Methods and systems for analyzing image data

ABSTRACT

Methods and systems for analysis of image data generated from various reference points. Particularly, the methods and systems provided are useful for real time analysis of image and sequence data generated during DNA sequencing methodologies.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.16/899,716, filed on Jun. 12, 2020, which is a continuation of U.S.patent application Ser. No. 15/153,953, filed May 13, 2016 which issuedas U.S. Pat. No. 10,689,696, which is a 371 International Application ofPCT/US2014/068409, filed Dec. 3, 2014, which claims the benefit of andpriority to Provisional Application No. 61/911,319, filed on Dec. 3,2013; Provisional Application No. 61/915,455, filed on Dec. 12, 2013;and Provisional Application No. 61/915,426, filed on Dec. 12, 2013. Eachof the aforementioned applications is hereby incorporated by referencein its entirety.

BACKGROUND

The analysis of image data presents a number of challenges, especiallywith respect to comparing images of an item or structure that arecaptured from different points of reference. One field that exemplifiesmany of these challenges is that of nucleic acid sequence analysis.

The detection of specific nucleic acid sequences present in a biologicalsample has a wide variety of applications, such as identifying andclassifying microorganisms, diagnosing infectious diseases, detectingand characterizing genetic abnormalities, identifying genetic changesassociated with cancer, studying genetic susceptibility to disease, andmeasuring response to various types of treatment. A valuable techniquefor detecting specific nucleic acid sequences in a biological sample isnucleic acid sequencing.

Nucleic acid sequencing methodology has evolved significantly from thechemical degradation methods used by Maxam and Gilbert and the strandelongation methods used by Sanger. Today, there are a number ofdifferent processes being employed to elucidate nucleic acid sequence. Aparticularly popular sequencing process is sequencing-by-synthesis. Onereason for its popularity is that this technique can be easily appliedto massively parallel sequencing projects. For example, using anautomated platform, it is possible to carry out hundreds of thousands ofsequencing reactions simultaneously. Sequencing-by-synthesis differsfrom the classic dideoxy sequencing approach in that, instead ofgenerating a large number of sequences and then characterizing them at alater step, real time monitoring of the incorporation of each base intoa growing chain is employed. Although this approach might be viewed asslow in the context of an individual sequencing reaction, it can be usedfor generating large amounts of sequence information in each sequencingcycle when hundreds of thousands to millions of reactions are performedin parallel. Despite these advantages, the vast size and quantity ofsequence information obtained through such methods can limit the speedand quality of analysis of sequence data. Thus, there is a need formethods and systems which improve the speed and accuracy of analysis ofnucleic acid sequencing data.

BRIEF SUMMARY

Provided herein are methods for evaluating the quality of a base callfrom a sequencing read. In some embodiments, the methods can comprisethe steps of: calculating a set of predictor values for the base call;and then using the predictor values to look up a quality score in aquality table. In some embodiments, the sequencing read utilizestwo-channel base calling. In some embodiments, the sequencing readutilizes one-channel base calling. In certain aspects, the quality tableis generated using Phred scoring on a calibration data set, thecalibration set being representative of run and sequence variability. Incertain aspects, the predictor values are selected from the groupconsisting of: online overlap; purity; phasing; start5; hexamer score;motif accumulation; endiness; approximate homopolymer; intensity decay;penultimate chastity; and signal overlap with background (SOWB). Incertain aspects, the set of predictor values comprises online overlap;purity; phasing; and start5. In certain aspects, the set of predictorvalues comprises hexamer score; and motif accumulation.

In certain aspects, the method further comprises the steps of:discounting unreliable quality scores at the end of each read;identifying reads where the second worst chastity in the first 25 basecalls is below a pre-established threshold; and marking the reads aspoor quality data. In certain aspects, the method further comprisesusing an algorithm to identify a threshold of reliability. In certainaspects, reliable base calls comprise q-values, or other valuesindicative of data quality or statistical significance, above thethreshold and unreliable base calls comprise q-values, or other valuesindicative of data quality or statistical significance, below thethreshold. In certain aspects, the algorithm comprises an End AnchoredMaximal Scoring Segments (EAMSS) algorithm. In certain aspects, thealgorithm uses a Hidden Markov Model that identifies shifts in the localdistributions of quality scores.

Also provided herein is a system for evaluating the quality of a basecall from a sequencing read, the system comprising: a processor; astorage capacity; and a program for evaluating the quality of a basecall from a sequencing read, the program comprising instructions for:calculating a set of predictor values for the base call; and then usingthe predictor values to look up a quality score in a quality table. Incertain aspects, the quality table is generated using Phred scoring on acalibration data set, the calibration set being representative of runand sequence variability. In certain aspects, the predictor values areselected from the group consisting of: online overlap; purity; phasing;start5; hexamer score; motif accumulation; endiness; approximatehomopolymer; intensity decay; penultimate chastity; and signal overlapwith background (SOWB). In certain aspects, the set of predictor valuescomprises online overlap; purity; phasing; and start5. In certainaspects, the set of predictor values comprises hexamer score; and motifaccumulation.

In certain aspects, the system can further comprise instructions for:discounting unreliable quality scores at the end of each read;identifying reads where the second worst chastity in the first 25 basecalls is below a pre-established threshold; and marking the reads aspoor quality data. In certain aspects, the system further comprisesinstructions for using an algorithm to identify a threshold ofreliability. In certain aspects, the reliable base calls compriseq-values, or other values indicative of data quality or statisticalsignificance, above the threshold and unreliable base calls compriseq-values, or other values indicative of data quality or statisticalsignificance, below the threshold. In certain aspects, the algorithmcomprises an End Anchored Maximal Scoring Segments (EAMSS) algorithm. Incertain aspects, the algorithm uses a Hidden Markov Model thatidentifies shifts in the local distributions of quality scores.

Also presented herein are methods and system for generating aphasing-corrected intensity value. The methods can comprise: performinga plurality of cycles of a sequencing by synthesis reaction such that,at each cycle, a signal is generated indicative of incorporation of thesame nucleotide into a plurality of identical polynucleotides, whereby aportion of the signal is noise associated with a nucleotide incorporatedduring a previous cycle; detecting the signal at each cycle, the signalhaving an intensity value; and correcting the intensity value forphasing by applying a first order phasing correction to the intensityvalue; wherein a new first order phasing correction is calculated foreach cycle.

In some aspects, the first order phasing correction comprisessubtracting an intensity value from the immediately previous cycle fromthe intensity value of the current cycle. The method can furthercomprise subtracting an intensity value from the immediately subsequentcycle from the intensity value of the current cycle. In some aspects,the phasing correction comprises:I_((cycle)corrected)=I_((cycle) N)−X*I_((cycle)N−1)−Y*I_((cycle) N+1).In certain aspects, the values of X and/or Y are chosen to optimize achastity determination. In certain aspects, the chastity determinationcomprises mean chastity. In certain aspects, the sequencing run canutilize one-channel, two-channel or four-channel base calling.

Also presented herein are systems for generating a phasing-correctedintensity value. The systems can comprise: a processor; a storagecapacity; and a program for generating a phasing-corrected intensityvalue, the program comprising instructions for: performing a pluralityof cycles of a sequencing by synthesis reaction such that, at eachcycle, a signal is generated indicative of incorporation of the samenucleotide into a plurality of identical polynucleotides, whereby aportion of the signal is noise associated with a nucleotide incorporatedduring a previous cycle; detecting the signal at each cycle, the signalhaving an intensity value; and correcting the intensity value forphasing by applying a first order phasing correction to the intensityvalue; wherein a new first order phasing correction is calculated foreach cycle.

In some aspects, the first order phasing correction comprisessubtracting an intensity value from the immediately previous cycle fromthe intensity value of the current cycle. The method can furthercomprise subtracting an intensity value from the immediately subsequentcycle from the intensity value of the current cycle. In some aspects,the phasing correction comprises:I_((cycle)corrected)=I_((cycle) N)−X*I_((cycle)N−1)−Y*I_((cycle) N+1).In certain aspects, the values of X and/or Y are chosen to optimize achastity determination. In certain aspects, the chastity determinationcomprises mean chastity. In certain aspects, the sequencing run canutilize one-channel, two-channel or four-channel base calling.

Also presented herein are methods and systems for identifying anucleotide base from sequencing data where two separate images areobtained of an array of features on a surface. In some embodiments, themethod comprises: detecting the presence or absence of a signal in twodifferent channels for each of a plurality of features on an array at aparticular time, thereby generating a first set of intensity values anda second set of intensity values for each of the features, wherein thecombination of intensity values in each of the two channels correspondsto one of four different nucleotide bases; fitting four Gaussiandistributions to the intensity values, each distribution having acentroid; calculating a likelihood value that indicates the likelihoodof a particular feature belonging to each of the four distributions; andselecting for each feature of said plurality of features thedistribution having the highest likelihood value, wherein saiddistribution corresponds to the identity of the nucleotide base presentat said particular feature.

Also presented herein is a system for evaluating the quality of a basecall from a sequencing read, the system comprising: a processor; astorage capacity; and a program for identifying a nucleotide base, theprogram comprising instructions for: detecting the presence or absenceof a signal in two different channels for each of a plurality offeatures on an array at a particular time, thereby generating a firstset of intensity values and a second set of intensity values for each ofthe features, wherein the combination of intensity values in each of thetwo channels corresponds to one of four different nucleotide bases;fitting four Gaussian distributions to the intensity values, eachdistribution having a centroid; calculating a likelihood value thatindicates the likelihood of a particular feature belonging to each ofthe four distributions; and selecting for each feature of said pluralityof features the distribution having the highest likelihood value,wherein said distribution corresponds to the identity of the nucleotidebase present at said particular feature.

Also presented herein is a method of identifying a nucleotide base, themethod comprising: obtaining a first set of intensity values and asecond set of intensity values for each a plurality of features on anarray, wherein the intensity value for each feature in one or both setscorresponds to the presence or absence of a particular nucleotide baseout of four possible nucleotide bases at the feature; fitting fourGaussian distributions to the intensity values, each distribution havinga centroid; calculating four likelihood values for each feature, whereineach likelihood value indicates the likelihood of a particular featurebelonging to one of the four distributions; and selecting for eachfeature of said plurality of features the distribution having thehighest of the four likelihood values, wherein the distributioncorresponds to the identity of the nucleotide base present at theparticular feature.

Also presented herein is a system for evaluating the quality of a basecall from a sequencing read, the system comprising: a processor; astorage capacity; and a program for identifying a nucleotide base, theprogram comprising instructions for: obtaining a first set of intensityvalues and a second set of intensity values for each a plurality offeatures on an array, wherein the intensity value for each feature inone or both sets corresponds to the presence or absence of a particularnucleotide base out of four possible nucleotide bases at the feature;fitting four Gaussian distributions to the intensity values, eachdistribution having a centroid; calculating four likelihood values foreach feature, wherein each likelihood value indicates the likelihood ofa particular feature belonging to one of the four distributions; andselecting for each feature of said plurality of features thedistribution having the highest of the four likelihood values, whereinthe distribution corresponds to the identity of the nucleotide basepresent at the particular feature.

In any of the methods and systems described above, certain aspects caninclude embodiments wherein fitting can comprise using one or morealgorithms from the group consisting of: a k-means clustering algorithm,a k-means-like clustering algorithm, expectation maximization, and ahistogram based method. In some aspects, fitting can comprise using anExpectation Maximization algorithm. In some aspects, the method cancomprise normalizing the intensity values. In certain aspects, achastity value is calculated for each feature. In certain aspects, thechastity value is a function of the relative distance from a feature tothe two nearest Gaussian centroids. In some aspects, features having achastity value below a threshold value are filtered out.

The details of one or more embodiments are set forth in the accompanyingdrawings and the description below. Other features, objects, andadvantages will be apparent from the description and drawings, and fromthe claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A and 1B depict intensity data for a two-channel system. FIG. 1Ais a scatter plot showing raw intensities for a particular tile and aparticular cycle, where the C nucleotide is represented by signal inchannel 1 only, A nucleotide is represented by signal in channel 2 only,T nucleotide is represented by signal in both channels 1 and 2, and Gnucleotide is “dark.” FIG. 1B shows phasing corrected intensities of thesame data using a phasing correction according to one embodiment of themethods presented herein.

FIG. 2A depicts intensity data for a two-channel system which has beensubjected to various phasing corrections. FIG. 2A depicts cycle 150 inwhich phasing is under-corrected.

FIG. 2B depicts intensity data for a two-channel system which has beensubjected to various phasing corrections. FIG. 2B depicts cycle 150 inwhich phasing is optimally corrected data.

FIG. 2C depicts intensity data for a two-channel system which has beensubjected to various phasing corrections. FIG. 2C depicts cycle 150 inwhich phasing is overcorrected data.

FIG. 3 shows an exemplary plot of image intensities from two channelsequencing.

FIG. 4 shows an approach to fit Gaussian distributions to two-channelintensity data, according to one embodiment.

FIG. 5A sets forth an application of Expectation Maximization toone-channel sequencing data.

FIG. 5B sets forth an application of Expectation Maximization totwo-channel sequencing data.

FIG. 6 is a flow chart illustrating a method in accordance with anembodiment.

FIG. 7 is a flow chart illustrating a method in accordance with anembodiment.

FIG. 8 is a flow chart illustrating a method in accordance with anembodiment.

FIG. 9 is a flow chart illustrating a method in accordance with anembodiment.

FIG. 10 is a block diagram of a system in accordance with an embodiment.

DETAILED DESCRIPTION

The present application describes various methods and systems forcarrying out the methods. Examples of some of the methods are describedas a series of steps. However, it should be understood that embodimentsare not limited to the particular steps and/or order of steps describedherein. Steps may be omitted, steps may be modified, and/or other stepsmay be added. Moreover, steps described herein may be combined, stepsmay be performed simultaneously, steps may be performed concurrently,steps may be split into multiple sub-steps, steps may be performed in adifferent order, or steps (or a series of steps) may be re-performed inan iterative fashion. In addition, although different methods are setforth herein, it should be understood that the different methods (orsteps of the different methods) may be combined in other embodiments.

The analysis of image data presents a number of challenges, especiallywith respect to comparing images of an item or structure that arecaptured from different points of reference. Most image analysismethodology employs, at least in part, steps for aligning multipleseparate images with respect to each other based on characteristics orelements present in both images. Various embodiments of the compositionsand methods disclosed herein improve upon previous methods for imageanalysis. Some previous methods for image analysis are set forth in U.S.Patent Application Publication No. 2012/0020537 filed on Jan. 13, 2011and entitled, “DATA PROCESSING SYSTEM AND METHODS,” the content of whichis incorporated herein by reference in its entirety. Embodimentsdescribed hereinafter are also described in U.S. Provisional ApplicationNo. 61/911,319, filed on Dec. 3, 2013, which is incorporated herein byreference in its entirety. One or more embodiments may also be used withembodiments described in U.S. Provisional Application No. 62/052,189,filed on Sep. 18, 2014, which is incorporated herein by reference in itsentirety.

Recently, tools have been developed that acquire and analyze image datagenerated at different time points or perspectives. Some examplesinclude tools for analysis of satellite imagery and molecular biologytools for sequencing and characterizing the molecular identity of aspecimen. In any such system, acquiring and storing large numbers ofhigh-quality images typically requires massive amounts of storagecapacity. Additionally, once acquired and stored, the analysis of imagedata can become resource intensive and can interfere with processingcapacity of other functions, such as ongoing acquisition and storage ofadditional image data. As such, methods and systems which improve thespeed and accuracy of analysis of the acquisition and analysis of imagedata would be beneficial.

In the molecular biology field, one of the processes for nucleic acidsequencing in use is sequencing-by-synthesis. The technique can beapplied to massively parallel sequencing projects. For example, by usingan automated platform, it is possible to carry out hundreds of thousandsof sequencing reactions simultaneously. Thus, one of the embodiments ofthe present invention relates to instruments and methods for acquiring,storing, and analyzing image data generated during nucleic acidsequencing.

Enormous gains in the amount of data that can be acquired and storedmake streamlined image analysis methods even more beneficial. Forexample, the image analysis methods described herein permit bothdesigners and end users to make efficient use of existing computerhardware. Accordingly, presented herein are methods and systems whichreduce the computational burden of processing data in the face ofrapidly increasing data output. For example, in the field of DNAsequencing, yields have scaled 15-fold over the course of a recent year,and can now reach hundreds of gigabases in a single run of a DNAsequencing device. If computational infrastructure requirements grewproportionately, large genome-scale experiments would remain out ofreach to most researchers. Thus, the generation of more raw sequencedata will increase the need for secondary analysis and data storage,making optimization of data transport and storage extremely valuable.Some embodiments of the methods and systems presented herein can reducethe time, hardware, networking, and laboratory infrastructurerequirements needed to produce usable sequence data.

As used herein, a “feature” is an area of interest within a specimen orfield of view. When used in connection with microarray devices or othermolecular analytical devices, a feature refers to the area occupied bysimilar or identical molecules. For example, a feature can be anamplified oligonucleotide or any other group of a polynucleotide orpolypeptide with a same or similar sequence. In other embodiments, afeature can be any element or group of elements that occupy a physicalarea on a specimen. For example, a feature could be a parcel of land, abody of water or the like. When a feature is imaged, each feature willhave some area. Thus, in many embodiments, a feature is not merely onepixel.

The distances between features can be described in any number of ways.In some embodiments, the distances between features can be describedfrom the center of one feature to the center of another feature. Inother embodiments, the distances can be described from the edge of onefeature to the edge of another feature, or between the outer-mostidentifiable points of each feature. The edge of a feature can bedescribed as the theoretical or actual physical boundary on a chip, orsome point inside the boundary of the feature. In other embodiments, thedistances can be described in relation to a fixed point on the specimenor in the image of the specimen.

Multiple copies of nucleic acids at a feature can be sequenced, forexample, by providing a labeled nucleotide base to the array ofmolecules, thereby extending a primer hybridized to a nucleic acidwithin a feature so as to produce a signal corresponding to a featurecomprising the nucleic acid. In preferred embodiments, the nucleic acidswithin a feature are identical or substantially identical to each other.

In some of the image analysis methods described herein, each image inthe set of images includes colors signals, wherein a different colorcorresponds to a different nucleotide base. In some aspects, each imageof the set of images comprises signals having a single color selectedfrom at least four different colors. In certain aspects, each image inthe set of images comprises signals having a single color selected fromfour different colors.

With respect to certain four-channel methods described herein, nucleicacids can be sequenced by providing, four different labeled nucleotidebases to the array of molecules so as to produce four different images,each image comprising signals having a single color, wherein the signalcolor is different for each of the four different images, therebyproducing a cycle of four color images that corresponds to the fourpossible nucleotides present at a particular position in the nucleicacid. In certain aspects, such methods can further comprise providingadditional labeled nucleotide bases to the array of molecules, therebyproducing a plurality of cycles of color images.

With respect to certain two-channel methods described herein, nucleicacids can be sequenced utilizing methods and systems described in U.S.Patent Application Publication No. 2013/0079232, the disclosure of whichis incorporated herein by reference in its entirety. As a first example,a nucleic acid can be sequenced by providing a first nucleotide typethat is detected in a first channel, a second nucleotide type that isdetected in a second channel, a third nucleotide type that is detectedin both the first and the second channel and a fourth nucleotide typethat lacks a label that is not, or minimally, detected in eitherchannel. In certain aspects, such methods can further comprise providingadditional labeled nucleotide bases to the array of molecules, therebyproducing a plurality of cycles of color images.

Quality Scoring

Quality scoring refers to the process of assigning a quality score toeach base call. In some embodiments where four different nucleotides aredetected using fewer than four different labels, base calling requires adifferent set of analytical approaches compared to systems usingtraditional four-label detection. As an example, SBS can be performedutilizing two-channel methods and systems described in U.S. PatentApplication Publication No. 2013/0079232, the content of which isincorporated herein by reference in its entirety. For example, inembodiments that make use of two-channel detection, base calling isperformed by extracting image data from two images, rather than four.Because of the fundamental differences involved in two-channel basecalling, traditional quality scoring approaches as applied to fourchannel base calling is not compatible with two-channel base call data.For example, the error profile presented by two-channel data isfundamentally different from the error profile of four-channel data. Inview of these differences, a new approach for evaluating the quality ofa base call is required.

Accordingly, presented herein are methods and systems for evaluating thequality of a base call from a sequencing read. In some embodiments, thesequencing read utilizes two-channel base calling. In some embodiments,the sequencing read utilizes one-channel base calling.

The quality score is typically quoted as QXX where the XX is the scoreand it means that that particular call has a probability of error of10{circumflex over ( )}(−XX/10). For example Q30 equates to an errorrate of 1 in 1000, or 0.1% and Q40 equates to an error rate of 1 in10,000 or 0.01%.

In some embodiments, a quality table is generated using Phred scoring ona calibration data set, the calibration set being representative of runand sequence variability. Phred scoring is described in greater detailin U.S. Pat. No. 8,392,126 entitled, “METHOD AND SYSTEM FOR DETERMININGTHE ACCURACY OF DNA BASE IDENTIFICATIONS,” the content of which isincorporated herein by reference in its entirety.

In some embodiments, the methods can comprise the steps of: (a)calculating a set of predictor values for the base call; (b) using thepredictor values to look up a quality score in a quality table. Incertain embodiments, quality scoring is performed by calculating a setof predictors for each base call, and using those predictor values tolook up the quality score in a quality table. In some embodiments, thequality table is generated using a modification of the Phred algorithmon a calibration data set representative of run and sequencevariability. The predictor values for each base call can be any suitableaspect that may indicate or predict the quality of the base call in agiven sequencing run. For example, some suitable predictors are setforth in U.S. Patent Application Publication No. 2012/0020537 filed onJan. 13, 2011 and entitled, “DATA PROCESSING SYSTEM AND METHODS,” thecontent of which is incorporated herein by reference in its entirety. Asdescribed in greater detail hereinbelow, suitable predictor values caninclude, for example: online overlap; purity; phasing; start5; hexamerscore; motif accumulation; endiness; approximate homopolymer; intensitydecay; penultimate chastity; signal overlap with background (SOWB); andshifted purity G adjustment. Any suitable combination of the abovepredictor values can be used in the methods presented herein.

In certain embodiments, the quality predictors used in the Phredalgorithm include online overlap; purity; phasing; start5; hexamerscore; motif accumulation; endiness; approximate homopolymer; intensitydecay; penultimate chastity; and signal overlap with background (SOWB).

As used herein, “online overlap” refers to a measurement of theseparation between the foreground called intensities and the backgroundintensities. For example, in some embodiments, this score is a statisticmeasuring the signal to noise of the read up to the scored base call,and is weighted to account more for the last few base calls, althougheven the first base calls in the read have an influence.

As used herein, “purity” refers to a measurement that captures howreliable a base call is likely to be based only on the current cycle,and measures how significant the called base is when compared to theother three bases.

As used herein, “phasing” refers to a measurement of the noise carriedover from the previous and the next cycles, which is essentially the sumof phasing and pre-phasing weights.

As used herein, “Start5” refers to a binary metric that captures thesample preparation fragmentation at the beginning of a read. Forexample, in an exemplary embodiment, this predictor can receive a binaryscore of “1” during the first 5 cycles, and “0” for every cyclethereafter.

As used herein, “hexamer score” refers to a measurement that examineshexamers and returns an enrichment factor that reflects how much thehexamer is enriched near sequence specific errors. For example, in someembodiments, this score associates a measure of sequencing difficulty toevery six-base sequence and is applied starting at cycle 6 of the run.Thus, the values applied before cycle 6 are the mean value of thepredictor when all hexamers are averaged together.

As used herein, “motif accumulation” refers to a measurement thatmaintains a cumulative sum of the Hexamer Score predictor, accountingfor how difficult the sequence context has been in the prior cycles ofthe read. For example, in some embodiments, this score is the cumulativesum of the hexamer score and is intended to measure the overalldifficulty of the sequencing read up to the scored base call.

As used herein, “endiness” refers to a measurement that tracks how closethe read is to completion. For example, in some embodiments, this scoreis the reciprocal of the cycle number.

As used herein, “approximate homopolymer” refers to a calculation of thenumber of consecutive identical base calls preceding a base call. Incertain embodiments, the calculation can allow one exception, in orderto identify problematic sequence contexts such as homopolymer runs andproblematic motifs such as “GGCGG”.

As used herein, “intensity decay” refers to the identification of basecalls that suffer loss of signal as sequencing progresses. For example,this can be done by comparing the brightest intensity at the currentcycle to the brightest intensity at cycle 1.

As used herein, “penultimate chastity” refers to a measurement of earlyread quality in the first 25 bases based on the second worst chastityvalue. For example, in some embodiments, this score is related to theread quality, which is correlated with the overall level of quality inthe first 25 cycles. This predictor is very similar to the criteria usedto mark a read as filtered or unfiltered, and has the effect of makingthe quality scores agnostic as to whether all data from a run isanalyzed or only the data passing filter. Chastity can be determined asthe highest intensity value divided by the sum of the highest intensityvalue and the second highest intensity value, where the intensity valuesare obtained from four color channels. For example, in some embodiments,methods of quality evaluation can further include identifying readswhere the second worst chastity in the first subset of base calls isbelow a threshold, and marking those reads as poor quality data. Thefirst subset of base calls can be any suitable number of base callswhich provides a sufficient For example, the subset can be the first 1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,22, 23, 24, 25 or greater than the first 25 base calls. This can betermed read filtering, such that in certain embodiments, clusters thatmeet this cutoff are referred to as having “passed filter”.

As used herein, “signal overlap with background” (SOWB) refers to ameasurement of the separation of the signal from the noise in previousand subsequent cycles. In a preferred embodiment, the measurementutilizes the 5 cycles immediately preceding and following the currentcycle.

As used herein, “Shifted Purity G adjustment” refers to a measurement ofthe separation of the signal from the noise for the current base callonly, while also accounting for G quenching effects. Due to aninteraction between the dye and the DNA base incorporated in theprevious cycle, the intensities in certain color channels may bedecreased (quenched) in cycles following those cycles where a Gnucleotide was incorporated.

After calculating quality scores, additional operations can optionallybe performed. Thus, in some embodiments, the method for evaluating thequality of a base call further comprises discounting unreliable qualityscores at the end of each read. In preferred embodiments, the step ofdiscounting unreliable quality scores comprises using an algorithm toidentify a threshold of reliability. In a more preferred embodiment,reliable base calls comprise q-values above the threshold and unreliablebase calls comprise q-values below the threshold. An algorithm fordetermining a threshold of reliability can comprise the End AnchoredMaximal Scoring Segments (EAMSS) algorithm, for example. As used herein,an “EAMSS algorithm” is an algorithm that identifies transition pointswhere good and reliable base calls (with mostly high q-values) becomeunreliable base calls (with mostly low q-values). The identification ofsuch transition points can be done, for example, using a Hidden MarkovModel that identifies shifts in the local distributions of qualityscores. For example, a Hidden Markov Model can be used. Useful HiddenMarkov Models are described, for example, in Lawrence R. Rabiner(February 1989). “A tutorial on Hidden Markov Models and selectedapplications in speech recognition”. Proceedings of the IEEE 77 (2):257-286. doi: 10.1109/5.18626. However, it will be apparent to one ofskill in the art that any suitable method of discounting unreliablequality scores may be employed. In a preferred embodiment, unreliablebase calls can include base calls with a strong bias toward G basecalls.

Real Time Metrics

The methods and systems provided herein can also utilize real-timemetrics to display run quality to a user. Metrics can be displayed asgraphs, charts, tables, pictures or any other suitable display methodthat provides a meaningful or useful representation of some aspect ofrun quality to a user. For example, real-time metrics displayed to auser can include a display of intensity values over the cycles of a run,the quality of the focus of optical equipment and cluster density ineach lane. Additional metrics displays can include Q score, shown as adistribution based on the Q score, or as a heat map on a per cyclebasis, for example. In some embodiments, real time metrics can include asummary table of various parameters, sorted by, for example, lane, tile,or cycle number. Image data from an entire tile or subregion of a tilemay be displayed for a visual confirmation of image quality. Such imagedata may include close-up, thumbnail images of some or all parts of animage.

Additionally, some metrics displays can include the error rate on aper-cycle basis. The error rate can be calculated using a controlnucleic acid.

Sequencing Methods

The methods described herein can be used in conjunction with a varietyof nucleic acid sequencing techniques. Particularly applicabletechniques are those wherein nucleic acids are attached at fixedlocations in an array such that their relative positions do not changeand wherein the array is repeatedly imaged. Embodiments in which imagesare obtained in different color channels, for example, coinciding withdifferent labels used to distinguish one nucleotide base type fromanother are particularly applicable. In some embodiments, the process todetermine the nucleotide sequence of a target nucleic acid can be anautomated process. Preferred embodiments include sequencing-by-synthesis(“SBS”) techniques.

SBS techniques generally involve the enzymatic extension of a nascentnucleic acid strand through the iterative addition of nucleotidesagainst a template strand. In traditional methods of SBS, a singlenucleotide monomer may be provided to a target nucleotide in thepresence of a polymerase in each delivery. However, in the methodsdescribed herein, more than one type of nucleotide monomer can beprovided to a target nucleic acid in the presence of a polymerase in adelivery.

SBS can utilize nucleotide monomers that have a terminator moiety orthose that lack any terminator moieties. Methods utilizing nucleotidemonomers lacking terminators include, for example, pyrosequencing andsequencing using γ-phosphate-labeled nucleotides, as set forth infurther detail below. In methods using nucleotide monomers lackingterminators, the number of nucleotides added in each cycle is generallyvariable and dependent upon the template sequence and the mode ofnucleotide delivery. For SBS techniques that utilize nucleotide monomershaving a terminator moiety, the terminator can be effectivelyirreversible under the sequencing conditions used as is the case fortraditional Sanger sequencing which utilizes dideoxynucleotides, or theterminator can be reversible as is the case for sequencing methodsdeveloped by Solexa (now Illumina, Inc.).

SBS techniques can utilize nucleotide monomers that have a label moietyor those that lack a label moiety. Accordingly, incorporation events canbe detected based on a characteristic of the label, such as fluorescenceof the label; a characteristic of the nucleotide monomer such asmolecular weight or charge; a byproduct of incorporation of thenucleotide, such as release of pyrophosphate; or the like. Inembodiments, where two or more different nucleotides are present in asequencing reagent, the different nucleotides can be distinguishablefrom each other, or alternatively, the two or more different labels canbe the indistinguishable under the detection techniques being used. Forexample, the different nucleotides present in a sequencing reagent canhave different labels and they can be distinguished using appropriateoptics as exemplified by the sequencing methods developed by Solexa (nowIllumina, Inc.).

Preferred embodiments include pyrosequencing techniques. Pyrosequencingdetects the release of inorganic pyrophosphate (PPi) as particularnucleotides are incorporated into the nascent strand (Ronaghi, M.,Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996)“Real-time DNA sequencing using detection of pyrophosphate release.”Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) “Pyrosequencingsheds light on DNA sequencing.” Genome Res. 11(1), 3-11; Ronaghi, M.,Uhlen, M. and Nyren, P. (1998) “A sequencing method based on real-timepyrophosphate.” Science 281(5375), 363; U.S. Pat. Nos. 6,210,891;6,258,568 and 6,274,320, the disclosures of which are incorporatedherein by reference in their entireties). In pyrosequencing, releasedPPi can be detected by being immediately converted to adenosinetriphosphate (ATP) by ATP sulfurylase, and the level of ATP generated isdetected via luciferase-produced photons. The nucleic acids to besequenced can be attached to features in an array and the array can beimaged to capture the chemiluminscent signals that are produced due toincorporation of a nucleotides at the features of the array. An imagecan be obtained after the array is treated with a particular nucleotidetype (e.g. A, T, C or G). Images obtained after addition of eachnucleotide type will differ with regard to which features in the arrayare detected. These differences in the image reflect the differentsequence content of the features on the array. However, the relativelocations of each feature will remain unchanged in the images. Theimages can be stored, processed and analyzed using the methods set forthherein. For example, images obtained after treatment of the array witheach different nucleotide type can be handled in the same way asexemplified herein for images obtained from different detection channelsfor reversible terminator-based sequencing methods.

In another exemplary type of SBS, cycle sequencing is accomplished bystepwise addition of reversible terminator nucleotides containing, forexample, a cleavable or photobleachable dye label as described, forexample, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures ofwhich are incorporated herein by reference. This approach is beingcommercialized by Solexa (now Illumina Inc.), and is also described inWO 91/06678 and WO 07/123,744, each of which is incorporated herein byreference. The availability of fluorescently-labeled terminators inwhich both the termination can be reversed and the fluorescent labelcleaved facilitates efficient cyclic reversible termination (CRT)sequencing. Polymerases can also be co-engineered to efficientlyincorporate and extend from these modified nucleotides.

Preferably in reversible terminator-based sequencing embodiments, thelabels do not substantially inhibit extension under SBS reactionconditions. However, the detection labels can be removable, for example,by cleavage or degradation. Images can be captured followingincorporation of labels into arrayed nucleic acid features. Inparticular embodiments, each cycle involves simultaneous delivery offour different nucleotide types to the array and each nucleotide typehas a spectrally distinct label. Four images can then be obtained, eachusing a detection channel that is selective for one of the fourdifferent labels. Alternatively, different nucleotide types can be addedsequentially and an image of the array can be obtained between eachaddition step. In such embodiments each image will show nucleic acidfeatures that have incorporated nucleotides of a particular type.Different features will be present or absent in the different images duethe different sequence content of each feature. However, the relativeposition of the features will remain unchanged in the images. Imagesobtained from such reversible terminator-SBS methods can be stored,processed and analyzed as set forth herein. Following the image capturestep, labels can be removed and reversible terminator moieties can beremoved for subsequent cycles of nucleotide addition and detection.Removal of the labels after they have been detected in a particularcycle and prior to a subsequent cycle can provide the advantage ofreducing background signal and crosstalk between cycles. Examples ofuseful labels and removal methods are set forth below.

In particular embodiments some or all of the nucleotide monomers caninclude reversible terminators. In such embodiments, reversibleterminators/cleavable fluors can include fluor linked to the ribosemoiety via a 3′ ester linkage (Metzker, Genome Res. 15:1767-1776 (2005),which is incorporated herein by reference). Other approaches haveseparated the terminator chemistry from the cleavage of the fluorescencelabel (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), whichis incorporated herein by reference in its entirety). Ruparel et aldescribed the development of reversible terminators that used a small 3′allyl group to block extension, but could easily be deblocked by a shorttreatment with a palladium catalyst. The fluorophore was attached to thebase via a photocleavable linker that could easily be cleaved by a 30second exposure to long wavelength UV light. Thus, either disulfidereduction or photocleavage can be used as a cleavable linker. Anotherapproach to reversible termination is the use of natural terminationthat ensues after placement of a bulky dye on a dNTP. The presence of acharged bulky dye on the dNTP can act as an effective terminator throughsteric and/or electrostatic hindrance. The presence of one incorporationevent prevents further incorporations unless the dye is removed.Cleavage of the dye removes the fluor and effectively reverses thetermination. Examples of modified nucleotides are also described in U.S.Pat. Nos. 7,427,673, and 7,057,026, the disclosures of which areincorporated herein by reference in their entireties.

Additional exemplary SBS systems and methods which can be utilized withthe methods and systems described herein are described in U.S. PatentApplication Publication No. 2007/0166705, U.S. Patent ApplicationPublication No. 2006/0188901, U.S. Pat. No. 7,057,026, U.S. PatentApplication Publication No. 2006/0240439, U.S. Patent ApplicationPublication No. 2006/0281109, PCT Publication No. WO 05/065814, U.S.Patent Application Publication No. 2005/0100900, PCT Publication No. WO06/064199, PCT Publication No. WO 07/010,251, U.S. Patent ApplicationPublication No. 2012/0270305 and U.S. Patent Application Publication No.2013/0260372, the disclosures of which are incorporated herein byreference in their entireties.

Some embodiments can utilize detection of four different nucleotidesusing fewer than four different labels. For example, SBS can beperformed utilizing methods and systems described in U.S. PatentApplication Publication No. 2013/0079232, the disclosure of which isincorporated herein by reference in its entirety. As a first example, apair of nucleotide types can be detected at the same wavelength, butdistinguished based on a difference in intensity for one member of thepair compared to the other, or based on a change to one member of thepair (e.g. via chemical modification, photochemical modification orphysical modification) that causes apparent signal to appear ordisappear compared to the signal detected for the other member of thepair. As a second example, three of four different nucleotide types canbe detected under particular conditions while a fourth nucleotide typelacks a label that is detectable under those conditions, or is minimallydetected under those conditions (e.g., minimal detection due tobackground fluorescence, etc). Incorporation of the first threenucleotide types into a nucleic acid can be determined based on presenceof their respective signals and incorporation of the fourth nucleotidetype into the nucleic acid can be determined based on absence or minimaldetection of any signal. As a third example, one nucleotide type caninclude label(s) that are detected in two different channels, whereasother nucleotide types are detected in no more than one of the channels.The aforementioned three exemplary configurations are not consideredmutually exclusive and can be used in various combinations. An exemplaryembodiment that combines all three examples, is a fluorescent-based SBSmethod that uses a first nucleotide type that is detected in a firstchannel (e.g. dATP having a label that is detected in the first channelwhen excited by a first excitation wavelength), a second nucleotide typethat is detected in a second channel (e.g. dCTP having a label that isdetected in the second channel when excited by a second excitationwavelength), a third nucleotide type that is detected in both the firstand the second channel (e.g. dTTP having at least one label that isdetected in both channels when excited by the first and/or secondexcitation wavelength) and a fourth nucleotide type that lacks a labelthat is not, or minimally, detected in either channel (e.g. dGTP havingno label).

Further, as described in the incorporated materials of U.S. PatentApplication Publication No. 2013/0079232, sequencing data can beobtained using a single channel. In such so-called one-dye sequencingapproaches, the first nucleotide type is labeled but the label isremoved after the first image is generated, and the second nucleotidetype is labeled only after a first image is generated. The thirdnucleotide type retains its label in both the first and second images,and the fourth nucleotide type remains unlabeled in both images.

Some embodiments can utilize sequencing by ligation techniques. Suchtechniques utilize DNA ligase to incorporate oligonucleotides andidentify the incorporation of such oligonucleotides. Theoligonucleotides typically have different labels that are correlatedwith the identity of a particular nucleotide in a sequence to which theoligonucleotides hybridize. As with other SBS methods, images can beobtained following treatment of an array of nucleic acid features withthe labeled sequencing reagents. Each image will show nucleic acidfeatures that have incorporated labels of a particular type. Differentfeatures will be present or absent in the different images due thedifferent sequence content of each feature, but the relative position ofthe features will remain unchanged in the images. Images obtained fromligation-based sequencing methods can be stored, processed and analyzedas set forth herein. Exemplary SBS systems and methods which can beutilized with the methods and systems described herein are described inU.S. Pat. Nos. 6,969,488, 6,172,218, and 6,306,597, the disclosures ofwhich are incorporated herein by reference in their entireties.

Some embodiments can utilize nanopore sequencing (Deamer, D. W. &Akeson, M. “Nanopores and nucleic acids: prospects for ultrarapidsequencing.” Trends Biotechnol. 18, 147-151 (2000); Deamer, D. and D.Branton, “Characterization of nucleic acids by nanopore analysis”. Acc.Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin,and J. A. Golovchenko, “DNA molecules and configurations in asolid-state nanopore microscope” Nat. Mater. 2:611-615 (2003), thedisclosures of which are incorporated herein by reference in theirentireties). In such embodiments, the target nucleic acid passes througha nanopore. The nanopore can be a synthetic pore or biological membraneprotein, such as α-hemolysin. As the target nucleic acid passes throughthe nanopore, each base-pair can be identified by measuring fluctuationsin the electrical conductance of the pore. (U.S. Pat. No. 7,001,792;Soni, G. V. & Meller, “A. Progress toward ultrafast DNA sequencing usingsolid-state nanopores.” Clin. Chem. 53, 1996-2001 (2007); Healy, K.“Nanopore-based single-molecule DNA analysis.” Nanomed. 2, 459-481(2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. “Asingle-molecule nanopore device detects DNA polymerase activity withsingle-nucleotide resolution.” J. Am. Chem. Soc. 130, 818-820 (2008),the disclosures of which are incorporated herein by reference in theirentireties). Data obtained from nanopore sequencing can be stored,processed and analyzed as set forth herein. In particular, the data canbe treated as an image in accordance with the exemplary treatment ofoptical images and other images that is set forth herein.

Some embodiments can utilize methods involving the real-time monitoringof DNA polymerase activity. Nucleotide incorporations can be detectedthrough fluorescence resonance energy transfer (FRET) interactionsbetween a fluorophore-bearing polymerase and γ-phosphate-labelednucleotides as described, for example, in U.S. Pat. Nos. 7,329,492 and7,211,414 (each of which is incorporated herein by reference) ornucleotide incorporations can be detected with zero-mode waveguides asdescribed, for example, in U.S. Pat. No. 7,315,019 (which isincorporated herein by reference) and using fluorescent nucleotideanalogs and engineered polymerases as described, for example, in U.S.Pat. No. 7,405,281 and U.S. Patent Application Publication No.2008/0108082 (each of which is incorporated herein by reference). Theillumination can be restricted to a zeptoliter-scale volume around asurface-tethered polymerase such that incorporation of fluorescentlylabeled nucleotides can be observed with low background (Levene, M. J.et al. “Zero-mode waveguides for single-molecule analysis at highconcentrations.” Science 299, 682-686 (2003); Lundquist, P. M. et al.“Parallel confocal detection of single molecules in real time.” Opt.Lett. 33, 1026-1028 (2008); Korlach, J. et al. “Selective aluminumpassivation for targeted immobilization of single DNA polymerasemolecules in zero-mode waveguide nano structures.” Proc. Natl. Acad.Sci. USA 105, 1176-1181 (2008), the disclosures of which areincorporated herein by reference in their entireties). Images obtainedfrom such methods can be stored, processed and analyzed as set forthherein.

The above SBS methods can be advantageously carried out in multiplexformats such that multiple different target nucleic acids aremanipulated simultaneously. In particular embodiments, different targetnucleic acids can be treated in a common reaction vessel or on a surfaceof a particular substrate. This allows convenient delivery of sequencingreagents, removal of unreacted reagents and detection of incorporationevents in a multiplex manner. In embodiments using surface-bound targetnucleic acids, the target nucleic acids can be in an array format. In anarray format, the target nucleic acids can be typically bound to asurface in a spatially distinguishable manner. The target nucleic acidscan be bound by direct covalent attachment, attachment to a bead orother particle or binding to a polymerase or other molecule that isattached to the surface. The array can include a single copy of a targetnucleic acid at each site (also referred to as a feature) or multiplecopies having the same sequence can be present at each site or feature.Multiple copies can be produced by amplification methods such as, bridgeamplification or emulsion PCR as described in further detail below.

The methods set forth herein can use arrays having features at any of avariety of densities including, for example, at least about 10features/cm², 100 features/cm², 500 features/cm², 1,000 features/cm²,5,000 features/cm², 10,000 features/cm², 50,000 features/cm², 100,000features/cm², 1,000,000 features/cm², 5,000,000 features/cm², or higher.

It will be appreciated that any of the above-described sequencingprocesses can be incorporated into the methods and/or systems describedherein. Furthermore, it will be appreciated that other known sequencingprocesses can be easily by implemented for use with the methods and/orsystems described herein. It will also be appreciated that the methodsand systems described herein are designed to be applicable with anynucleic acid sequencing technology. Additionally, it will be appreciatedthat the methods and systems described herein have even widerapplicability to any field where tracking and analysis of features in aspecimen over time or from different perspectives is important. Forexample, the methods and systems described herein can be applied whereimage data obtained by surveillance, aerial or satellite imagingtechnologies and the like is acquired at different time points orperspectives and analyzed.

Systems

A system capable of carrying out a method set forth herein, whetherintegrated with detection capabilities or not, can include a systemcontroller that is capable of executing a set of instructions to performone or more steps of a method, technique or process set forth herein.For example, the instructions can direct the performance of steps forcreating a set of amplicons in situ. Optionally, the instructions canfurther direct the performance of steps for detecting nucleic acidsusing methods set forth previously herein. A useful system controllermay include any processor-based or microprocessor-based system,including systems using microcontrollers, reduced instruction setcomputers (RISC), application specific integrated circuits (ASICs),field programmable gate array (FPGAs), logic circuits, and any othercircuit or processor capable of executing functions described herein. Aset of instructions for a system controller may be in the form of asoftware program. As used herein, the terms “software” and “firmware”are interchangeable, and include any computer program stored in memoryfor execution by a computer, including RAM memory, ROM memory, EPROMmemory, EEPROM memory, and non-volatile RAM (NVRAM) memory. The softwaremay be in various forms such as system software or application software.Further, the software may be in the form of a collection of separateprograms, or a program module within a larger program or a portion of aprogram module. The software also may include modular programming in theform of object-oriented programming.

Throughout this application various publications, patents and/or patentapplications have been referenced. The disclosure of these publicationsin their entireties is hereby incorporated by reference in thisapplication.

The term comprising is intended herein to be open-ended, including notonly the recited elements, but further encompassing any additionalelements.

A number of embodiments have been described. Nevertheless, it will beunderstood that various modifications may be made. Accordingly, otherembodiments are within the scope of the following claims.

The following description is with respect to FIGS. 1A, 1B, and 2.Embodiments described hereinafter are also described in U.S. ProvisionalApplication No. 61/915,455, filed on Dec. 12, 2013, which isincorporated herein by reference in its entirety.

The analysis of image data presents a number of challenges, especiallywith respect to comparing images of an item or structure that arecaptured from different points of reference. Most image analysismethodology employs, at least in part, steps for aligning multipleseparate images with respect to each other based on characteristics orelements present in both images. Various embodiments of the compositionsand methods disclosed herein improve upon previous methods for imageanalysis. Some previous methods for image analysis are set forth in U.S.Patent Application Publication No. 2012/0020537 filed on Jan. 13, 2011and entitled, “DATA PROCESSING SYSTEM AND METHODS,” the content of whichis incorporated herein by reference in its entirety.

Recently, tools have been developed that acquire and analyze image datagenerated at different time points or perspectives. Some examplesinclude tools for analysis of satellite imagery and molecular biologytools for sequencing and characterizing the molecular identity of aspecimen. In any such system, acquiring and storing large numbers ofhigh-quality images typically requires massive amounts of storagecapacity. Additionally, once acquired and stored, the analysis of imagedata can become resource intensive and can interfere with processingcapacity of other functions, such as ongoing acquisition and storage ofadditional image data. As such, methods and systems which improve thespeed and accuracy of analysis of the acquisition and analysis of imagedata would be beneficial.

In the molecular biology field, one of the processes for nucleic acidsequencing in use is sequencing-by-synthesis. The technique can beapplied to massively parallel sequencing projects. For example, by usingan automated platform, it is possible to carry out hundreds of thousandsof sequencing reactions simultaneously. Thus, one of the embodiments ofthe present invention relates to instruments and methods for acquiring,storing, and analyzing image data generated during nucleic acidsequencing.

Enormous gains in the amount of data that can be acquired and storedmake streamlined image analysis methods even more beneficial. Forexample, the image analysis methods described herein permit bothdesigners and end users to make efficient use of existing computerhardware. Accordingly, presented herein are methods and systems whichreduce the computational burden of processing data in the face ofrapidly increasing data output. For example, in the field of DNAsequencing, yields have scaled 15-fold over the course of a recent year,and can now reach hundreds of gigabases in a single run of a DNAsequencing device. If computational infrastructure requirements grewproportionately, large genome-scale experiments would remain out ofreach to most researchers. Thus, the generation of more raw sequencedata will increase the need for secondary analysis and data storage,making optimization of data transport and storage extremely valuable.Some embodiments of the methods and systems presented herein can reducethe time, hardware, networking, and laboratory infrastructurerequirements needed to produce usable sequence data.

As used herein, a “feature” is an area of interest within a specimen orfield of view. When used in connection with microarray devices or othermolecular analytical devices, a feature refers to the area occupied bysimilar or identical molecules. For example, a feature can be anamplified oligonucleotide or any other group of a polynucleotide orpolypeptide with a same or similar sequence. In other embodiments, afeature can be any element or group of elements that occupy a physicalarea on a specimen. For example, a feature could be a parcel of land, abody of water or the like. When a feature is imaged, each feature willhave some area. Thus, in many embodiments, a feature is not merely onepixel.

The distances between features can be described in any number of ways.In some embodiments, the distances between features can be describedfrom the center of one feature to the center of another feature. Inother embodiments, the distances can be described from the edge of onefeature to the edge of another feature, or between the outer-mostidentifiable points of each feature. The edge of a feature can bedescribed as the theoretical or actual physical boundary on a chip, orsome point inside the boundary of the feature. In other embodiments, thedistances can be described in relation to a fixed point on the specimenor in the image of the specimen.

Multiple copies of nucleic acids at a feature can be sequenced, forexample, by providing a labeled nucleotide base to the array ofmolecules, thereby extending a primer hybridized to a nucleic acidwithin a feature so as to produce a signal corresponding to a featurecomprising the nucleic acid. In preferred embodiments, the nucleic acidswithin a feature are identical or substantially identical to each other.

In some of the image analysis methods described herein, each image inthe set of images includes colors signals, wherein a different colorcorresponds to a different nucleotide base. In some aspects, each imageof the set of images comprises signals having a single color selectedfrom at least four different colors. In certain aspects, each image inthe set of images comprises signals having a single color selected fromfour different colors.

With respect to certain four-channel methods described herein, nucleicacids can be sequenced by providing, four different labeled nucleotidebases to the array of molecules so as to produce four different images,each image comprising signals having a single color, wherein the signalcolor is different for each of the four different images, therebyproducing a cycle of four color images that corresponds to the fourpossible nucleotides present at a particular position in the nucleicacid. In certain aspects, such methods can further comprise providingadditional labeled nucleotide bases to the array of molecules, therebyproducing a plurality of cycles of color images.

With respect to certain two-channel methods described herein, nucleicacids can be sequenced utilizing methods and systems described in U.S.Patent Application Publication No. 2013/0079232, the disclosure of whichis incorporated herein by reference in its entirety. As a first example,a nucleic acid can be sequenced by providing a first nucleotide typethat is detected in a first channel, a second nucleotide type that isdetected in a second channel, a third nucleotide type that is detectedin both the first and the second channel and a fourth nucleotide typethat lacks a label that is not, or minimally, detected in eitherchannel. In certain aspects, such methods can further comprise providingadditional labeled nucleotide bases to the array of molecules, therebyproducing a plurality of cycles of color images.

Phasing Estimation

A phasing estimation is an analytical tool for reducing noise duringmultiple cycles of a sequencing run. For example, in any given cycle ofa sequencing run, one or more molecules may become “phased” at eachcycle. As used herein, “phased”, “phasing” and like terms refer to thesituation where a molecule at a feature falls at least one base behindother molecules at the same feature as a result of the feature beingsequenced at a particular cycle. As used herein, “pre-phased”,“pre-phasing” and like terms refer to the situation where a molecule ata feature jumps at least one base ahead of other molecules at the samefeature as a result of the feature being sequenced at a particularcycle. The effects of phasing and pre-phasing become more pronouncedwith higher phasing/prephasing rates and longer reads. Thus, in order tomaintain accurate base calling over an extended number of cycles, it isimportant to correct for this phenomenon. The methods and systemspresented herein provide a computational solution which surprisinglyyield improved base calling over extended sequencing cycles compared totraditional phasing correction methods.

The methods and systems provided herein can assume that a fixed fractionof molecules at each feature become phased at each cycle, in the sensethat those molecules fall one base behind in sequencing. Thus, in apreferred embodiment, a phasing estimation is performed to adjust theobserved intensities in a way that reduces the noise created by phasedmolecules.

Traditional phasing correction can be performed by methods as describedin the incorporated materials of U.S. Patent Application Publication No.2012/0020537. As described therein, a traditional approach to phasingcorrection involves creating a phasing matrix to model phasing effectsat any given cycle. This can be done, for example, by creating an N×Nmatrix where N is the total number of cycles. Then, to phase-correctintensities for a given cycle, the inverse of the phasing matrix istaken and the matrix row corresponding to the cycle is extracted. As aresult, the vector of actual intensities for cycles 1 through N is theproduct of phasing matrix inverse and observed intensities for cycles 1through N. As an example of such an approach, a phasing estimation isperformed by calculating phasing and prephasing rates from the first 12cycles of intensity data. Corrections derived from these rates are thenapplied to all cycles to improve basecalling error rates. Becausephasing rates are estimated during the early part of a sequencing run,an inaccurate phasing rate estimation made during early cycles (e.g.,during cycles 1-12) can potentially affect the data obtained duringlater cycles.

For example, in traditional phasing correction methods, if the phasingrate estimation is off, basecall accuracy is affected for the entiretyof a run and is not adjusted. This effect is enhanced when sequencinglow diversity samples such as single amplicons. Thus, if phasing ratesestimated during early cycles are based on a low diversity of bases, therates may not accurately reflect phasing rates during later cycles of asequencing run. Traditional phasing correction approaches are noteffective in adjusting to changing phasing rates in later cycles.Additionally, traditional phasing correction approaches are not designedto estimate the phasing rate on 2 channel data.

Empirical Phasing Correction

Presented herein are improved methods of performing phasing correction.The methods described herein provide surprising advantages in comparisonto the traditional phasing correction approaches described above. Forexample, the methods presented herein include determining phasingcorrections as an ongoing analysis throughout a sequencing run. As aresult of this approach, an inaccurate phasing rate estimation madeduring early cycles (e.g., during cycles 1-12) will not adversely affectlater cycles.

Presented herein is a method of performing phasing correction comprisingempirical analysis. The methods presented herein are an alternative to,or can supplement traditional phasing correction analysis as describedabove. The methods presented herein are surprisingly effective whenapplied to, for example, 1-channel and 2-channel data.

In some embodiments, the methods comprise an empirical phasingcorrection. Particular embodiments employ the step of applying a firstorder phasing correction. For example, in some embodiments, the methodcomprises a first order phasing correction for a given cycle as definedby the following:

I(cycle)=I(cycle)−X*I(cycle−1)−Y*I(cycle+1)

where I represents intensity and X and Y represent the phasing andprephasing weights calculated for this cycle. It will be understoodthat, utilizing this approach, if the correct values of X and Y arechosen, then the mean chastity (quality) of intensity values aremaximized. For example, it is possible to numerically optimize via apattern search over X and Y to maximize the mean chastity. Once X and Yvalues are identified with maximal mean chastity, then the abovecorrection can be applied and then basecalling can occur directlysubsequent.

In some embodiments, a separate phasing correction is calculated morethan once during a sequencing run. For example, in some embodiments, aseparate phasing correction is calculated 2, 3, 4, 5, 6, 7, 8, 9, 10,15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100,or more than 100 times during a sequencing run. In some embodiments, aphasing correction is calculated at nearly every cycle during asequencing run. In some embodiments, a phasing correction is calculatedat every cycle during a sequencing run.

In some embodiments, a separate phasing correction is calculated fordifferent locations of an imaged surface at the same cycle. For example,in some embodiments, a separate phasing correction is calculated forevery individual lane of an imaged surface, such as an individual flowcell lane. In some embodiments a separate phasing correction iscalculated for every subset of a lane, such as an imaging swath within aflow cell lane. In some embodiments, a separate phasing correction iscalculated for each individual image, such as, for example, every tile.In certain embodiments, a separate phasing correction is calculated forevery tile at every cycle.

In particular embodiments, the approach described above for empiricalphasing correction serves to optimize the phasing and prephasingcorrections for each cycle and tile to maximize the mean chastity of theintensity data. The result is that RTA is no longer dependent upon anaccurate rate calculation, since the best correction is applied at everycycle, but instead performs cycle-by-cycle corrections that are analyzedat a later cycle, for example, cycle 25. This analysis gives acalculated rate that can be saved in a file and/or displayed in a userinterface.

As set forth in FIG. 1, the effects of application of the above approachcan result in a dramatic resolution of base calling. FIG. 1A shows rawintensities for a particular tile and a particular cycle in atwo-channel system where the C nucleotide is represented by signal inchannel 1 only, A nucleotide is represented by signal in channel 2 only,T nucleotide is represented by signal in both channels 1 and 2, and Gnucleotide is “dark.” FIG. 1B shows phasing corrected intensities of thesame data using the above-described phasing correction. As shown in FIG.1B, application of the above-described phasing correction approachdramatically increases resolution of intensities assigned to each of thefour bases. To aid in distinguishing the data points, data for thenucleotides may be indicated in different colors. For example, the Anucleotide data may be indicated in green, the C nucleotide may beindicated in black, the T nucleotide may be indicated in pink, and the Gnucleotide may be indicated in blue.

In particular embodiments, due to the physics of phasing, as reads getlonger, higher order terms can become more and more important in phasingcorrection. Thus, in particular embodiments, to correct for this, asecond order empirical phasing correction can be calculated. Forexample, in some embodiments, the method comprises a second orderphasing correction as defined by the following:

I(cycle)=−a*I(cycle−2)−A*I(cycle−1)+I(cycle)−B*I(cycle+1)−b*I(cycle+2)

where I represents intensity and a, A, B, and b represent the first andsecond order terms to the phasing correction. In particular embodiments,the calculation is optimized over a, A, B, b.

In some embodiments, higher order terms can be used to correct for highphasing and/or prephasing rates. In particular embodiments, the higherthe phasing and/or prephasing rates, the bigger the difference thehigher order terms make. In particular embodiments, the higher thephasing and/or prephasing rates and the longer the read, the moreimportant the higher order terms become.

The methods provided herein are superior and provide significantadvantages over traditional phasing correction approaches. For example,unlike traditional methods, there is no requirement to accuratelyestimate a phasing rate in the first 10 cycles of a run. Further, unliketraditional methods, there is no requirement to aggregate phasingestimates across tiles to arrive at phasing correction that isgeneralized across all tiles. In addition, unlike traditional methodswhere a phasing correction is derived and applied to all cycles, in themethods presented herein, cycle to cycle corrections are independent.Specifically, permanent error is not introduced into the phasingcorrection algorithm by a few cycles of bad data.

The methods presented herein are particularly unaffected by lowdiversity runs. For example, in sequencing runs where only one or a veryfew sequences are being determined, such as in single amplicon or inmetagenomic applications, the phasing correction is not entirelydependent on the accuracy of a calculation made based on a limited setof early cycles, and instead can optimize phasing corrections for eachtile and each cycle.

Although the methods and systems presented herein are exemplifiedprimarily in the context of two-channel sequencing data, it should beappreciated that the same methods and algorithms can be directly appliedto 4 channel data with substantially reduced error rates in increasedalignment scores. An example of phasing correction calculations using 2channel data is presented below as Example 1. An example of phasingcorrection calculations using 4 channel data is presented below asExample 2.

Sequencing Methods

The methods described herein can be used in conjunction with a varietyof nucleic acid sequencing techniques. Particularly applicabletechniques are those wherein nucleic acids are attached at fixedlocations in an array such that their relative positions do not changeand wherein the array is repeatedly imaged. Embodiments in which imagesare obtained in different color channels, for example, coinciding withdifferent labels used to distinguish one nucleotide base type fromanother are particularly applicable. In some embodiments, the process todetermine the nucleotide sequence of a target nucleic acid can be anautomated process. Preferred embodiments include sequencing-by-synthesis(“SBS”) techniques.

SBS techniques generally involve the enzymatic extension of a nascentnucleic acid strand through the iterative addition of nucleotidesagainst a template strand. In traditional methods of SBS, a singlenucleotide monomer may be provided to a target nucleotide in thepresence of a polymerase in each delivery. However, in the methodsdescribed herein, more than one type of nucleotide monomer can beprovided to a target nucleic acid in the presence of a polymerase in adelivery.

SBS can utilize nucleotide monomers that have a terminator moiety orthose that lack any terminator moieties. Methods utilizing nucleotidemonomers lacking terminators include, for example, pyrosequencing andsequencing using γ-phosphate-labeled nucleotides, as set forth infurther detail below. In methods using nucleotide monomers lackingterminators, the number of nucleotides added in each cycle is generallyvariable and dependent upon the template sequence and the mode ofnucleotide delivery. For SBS techniques that utilize nucleotide monomershaving a terminator moiety, the terminator can be effectivelyirreversible under the sequencing conditions used as is the case fortraditional Sanger sequencing which utilizes dideoxynucleotides, or theterminator can be reversible as is the case for sequencing methodsdeveloped by Solexa (now Illumina, Inc.).

SBS techniques can utilize nucleotide monomers that have a label moietyor those that lack a label moiety. Accordingly, incorporation events canbe detected based on a characteristic of the label, such as fluorescenceof the label; a characteristic of the nucleotide monomer such asmolecular weight or charge; a byproduct of incorporation of thenucleotide, such as release of pyrophosphate; or the like. Inembodiments, where two or more different nucleotides are present in asequencing reagent, the different nucleotides can be distinguishablefrom each other, or alternatively, the two or more different labels canbe the indistinguishable under the detection techniques being used. Forexample, the different nucleotides present in a sequencing reagent canhave different labels and they can be distinguished using appropriateoptics as exemplified by the sequencing methods developed by Solexa (nowIllumina, Inc.).

Preferred embodiments include pyrosequencing techniques. Pyrosequencingdetects the release of inorganic pyrophosphate (PPi) as particularnucleotides are incorporated into the nascent strand (Ronaghi, M.,Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996)“Real-time DNA sequencing using detection of pyrophosphate release.”Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) “Pyrosequencingsheds light on DNA sequencing.” Genome Res. 11(1), 3-11; Ronaghi, M.,Uhlen, M. and Nyren, P. (1998) “A sequencing method based on real-timepyrophosphate.” Science 281(5375), 363; U.S. Pat. Nos. 6,210,891;6,258,568 and 6,274,320, the disclosures of which are incorporatedherein by reference in their entireties). In pyrosequencing, releasedPPi can be detected by being immediately converted to adenosinetriphosphate (ATP) by ATP sulfurylase, and the level of ATP generated isdetected via luciferase-produced photons. The nucleic acids to besequenced can be attached to features in an array and the array can beimaged to capture the chemiluminscent signals that are produced due toincorporation of a nucleotides at the features of the array. An imagecan be obtained after the array is treated with a particular nucleotidetype (e.g. A, T, C or G). Images obtained after addition of eachnucleotide type will differ with regard to which features in the arrayare detected. These differences in the image reflect the differentsequence content of the features on the array. However, the relativelocations of each feature will remain unchanged in the images. Theimages can be stored, processed and analyzed using the methods set forthherein. For example, images obtained after treatment of the array witheach different nucleotide type can be handled in the same way asexemplified herein for images obtained from different detection channelsfor reversible terminator-based sequencing methods.

In another exemplary type of SBS, cycle sequencing is accomplished bystepwise addition of reversible terminator nucleotides containing, forexample, a cleavable or photobleachable dye label as described, forexample, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures ofwhich are incorporated herein by reference. This approach is beingcommercialized by Solexa (now Illumina Inc.), and is also described inWO 91/06678 and WO 07/123,744, each of which is incorporated herein byreference. The availability of fluorescently-labeled terminators inwhich both the termination can be reversed and the fluorescent labelcleaved facilitates efficient cyclic reversible termination (CRT)sequencing. Polymerases can also be co-engineered to efficientlyincorporate and extend from these modified nucleotides.

Preferably in reversible terminator-based sequencing embodiments, thelabels do not substantially inhibit extension under SBS reactionconditions. However, the detection labels can be removable, for example,by cleavage or degradation. Images can be captured followingincorporation of labels into arrayed nucleic acid features. Inparticular embodiments, each cycle involves simultaneous delivery offour different nucleotide types to the array and each nucleotide typehas a spectrally distinct label. Four images can then be obtained, eachusing a detection channel that is selective for one of the fourdifferent labels. Alternatively, different nucleotide types can be addedsequentially and an image of the array can be obtained between eachaddition step. In such embodiments each image will show nucleic acidfeatures that have incorporated nucleotides of a particular type.Different features will be present or absent in the different images duethe different sequence content of each feature. However, the relativeposition of the features will remain unchanged in the images. Imagesobtained from such reversible terminator-SBS methods can be stored,processed and analyzed as set forth herein. Following the image capturestep, labels can be removed and reversible terminator moieties can beremoved for subsequent cycles of nucleotide addition and detection.Removal of the labels after they have been detected in a particularcycle and prior to a subsequent cycle can provide the advantage ofreducing background signal and crosstalk between cycles. Examples ofuseful labels and removal methods are set forth below.

In particular embodiments some or all of the nucleotide monomers caninclude reversible terminators. In such embodiments, reversibleterminators/cleavable fluors can include fluor linked to the ribosemoiety via a 3′ ester linkage (Metzker, Genome Res. 15:1767-1776 (2005),which is incorporated herein by reference). Other approaches haveseparated the terminator chemistry from the cleavage of the fluorescencelabel (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), whichis incorporated herein by reference in its entirety). Ruparel et aldescribed the development of reversible terminators that used a small 3′allyl group to block extension, but could easily be deblocked by a shorttreatment with a palladium catalyst. The fluorophore was attached to thebase via a photocleavable linker that could easily be cleaved by a 30second exposure to long wavelength UV light. Thus, either disulfidereduction or photocleavage can be used as a cleavable linker. Anotherapproach to reversible termination is the use of natural terminationthat ensues after placement of a bulky dye on a dNTP. The presence of acharged bulky dye on the dNTP can act as an effective terminator throughsteric and/or electrostatic hindrance. The presence of one incorporationevent prevents further incorporations unless the dye is removed.Cleavage of the dye removes the fluor and effectively reverses thetermination. Examples of modified nucleotides are also described in U.S.Pat. Nos. 7,427,673, and 7,057,026, the disclosures of which areincorporated herein by reference in their entireties.

Additional exemplary SBS systems and methods which can be utilized withthe methods and systems described herein are described in U.S. PatentApplication Publication No. 2007/0166705, U.S. Patent ApplicationPublication No. 2006/0188901, U.S. Pat. No. 7,057,026, U.S. PatentApplication Publication No. 2006/0240439, U.S. Patent ApplicationPublication No. 2006/0281109, PCT Publication No. WO 05/065814, U.S.Patent Application Publication No. 2005/0100900, PCT Publication No. WO06/064199, PCT Publication No. WO 07/010,251, U.S. Patent ApplicationPublication No. 2012/0270305 and U.S. Patent Application Publication No.2013/0260372, the disclosures of which are incorporated herein byreference in their entireties.

Some embodiments can utilize detection of four different nucleotidesusing fewer than four different labels. For example, SBS can beperformed utilizing methods and systems described in U.S. PatentApplication Publication No. 2013/0079232, the disclosure of which isincorporated herein by reference in its entirety. As a first example, apair of nucleotide types can be detected at the same wavelength, butdistinguished based on a difference in intensity for one member of thepair compared to the other, or based on a change to one member of thepair (e.g. via chemical modification, photochemical modification orphysical modification) that causes apparent signal to appear ordisappear compared to the signal detected for the other member of thepair. As a second example, three of four different nucleotide types canbe detected under particular conditions while a fourth nucleotide typelacks a label that is detectable under those conditions, or is minimallydetected under those conditions (e.g., minimal detection due tobackground fluorescence, etc). Incorporation of the first threenucleotide types into a nucleic acid can be determined based on presenceof their respective signals and incorporation of the fourth nucleotidetype into the nucleic acid can be determined based on absence or minimaldetection of any signal. As a third example, one nucleotide type caninclude label(s) that are detected in two different channels, whereasother nucleotide types are detected in no more than one of the channels.The aforementioned three exemplary configurations are not consideredmutually exclusive and can be used in various combinations. An exemplaryembodiment that combines all three examples, is a fluorescent-based SBSmethod that uses a first nucleotide type that is detected in a firstchannel (e.g. dATP having a label that is detected in the first channelwhen excited by a first excitation wavelength), a second nucleotide typethat is detected in a second channel (e.g. dCTP having a label that isdetected in the second channel when excited by a second excitationwavelength), a third nucleotide type that is detected in both the firstand the second channel (e.g. dTTP having at least one label that isdetected in both channels when excited by the first and/or secondexcitation wavelength) and a fourth nucleotide type that lacks a labelthat is not, or minimally, detected in either channel (e.g. dGTP havingno label).

Further, as described in the incorporated materials of U.S. PatentApplication Publication No. 2013/0079232, sequencing data can beobtained using a single channel. In such so-called one-dye sequencingapproaches, the first nucleotide type is labeled but the label isremoved after the first image is generated, and the second nucleotidetype is labeled only after a first image is generated. The thirdnucleotide type retains its label in both the first and second images,and the fourth nucleotide type remains unlabeled in both images.

Some embodiments can utilize sequencing by ligation techniques. Suchtechniques utilize DNA ligase to incorporate oligonucleotides andidentify the incorporation of such oligonucleotides. Theoligonucleotides typically have different labels that are correlatedwith the identity of a particular nucleotide in a sequence to which theoligonucleotides hybridize. As with other SBS methods, images can beobtained following treatment of an array of nucleic acid features withthe labeled sequencing reagents. Each image will show nucleic acidfeatures that have incorporated labels of a particular type. Differentfeatures will be present or absent in the different images due thedifferent sequence content of each feature, but the relative position ofthe features will remain unchanged in the images. Images obtained fromligation-based sequencing methods can be stored, processed and analyzedas set forth herein. Exemplary SBS systems and methods which can beutilized with the methods and systems described herein are described inU.S. Pat. Nos. 6,969,488, 6,172,218, and 6,306,597, the disclosures ofwhich are incorporated herein by reference in their entireties.

Some embodiments can utilize nanopore sequencing (Deamer, D. W. &Akeson, M. “Nanopores and nucleic acids: prospects for ultrarapidsequencing.” Trends Biotechnol. 18, 147-151 (2000); Deamer, D. and D.Branton, “Characterization of nucleic acids by nanopore analysis”. Acc.Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin,and J. A. Golovchenko, “DNA molecules and configurations in asolid-state nanopore microscope” Nat. Mater. 2:611-615 (2003), thedisclosures of which are incorporated herein by reference in theirentireties). In such embodiments, the target nucleic acid passes througha nanopore. The nanopore can be a synthetic pore or biological membraneprotein, such as α-hemolysin. As the target nucleic acid passes throughthe nanopore, each base-pair can be identified by measuring fluctuationsin the electrical conductance of the pore. (U.S. Pat. No. 7,001,792;Soni, G. V. & Meller, “A. Progress toward ultrafast DNA sequencing usingsolid-state nanopores.” Clin. Chem. 53, 1996-2001 (2007); Healy, K.“Nanopore-based single-molecule DNA analysis.” Nanomed. 2, 459-481(2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. “Asingle-molecule nanopore device detects DNA polymerase activity withsingle-nucleotide resolution.” J. Am. Chem. Soc. 130, 818-820 (2008),the disclosures of which are incorporated herein by reference in theirentireties). Data obtained from nanopore sequencing can be stored,processed and analyzed as set forth herein. In particular, the data canbe treated as an image in accordance with the exemplary treatment ofoptical images and other images that is set forth herein.

Some embodiments can utilize methods involving the real-time monitoringof DNA polymerase activity. Nucleotide incorporations can be detectedthrough fluorescence resonance energy transfer (FRET) interactionsbetween a fluorophore-bearing polymerase and γ-phosphate-labelednucleotides as described, for example, in U.S. Pat. Nos. 7,329,492 and7,211,414 (each of which is incorporated herein by reference) ornucleotide incorporations can be detected with zero-mode waveguides asdescribed, for example, in U.S. Pat. No. 7,315,019 (which isincorporated herein by reference) and using fluorescent nucleotideanalogs and engineered polymerases as described, for example, in U.S.Pat. No. 7,405,281 and U.S. Patent Application Publication No.2008/0108082 (each of which is incorporated herein by reference). Theillumination can be restricted to a zeptoliter-scale volume around asurface-tethered polymerase such that incorporation of fluorescentlylabeled nucleotides can be observed with low background (Levene, M. J.et al. “Zero-mode waveguides for single-molecule analysis at highconcentrations.” Science 299, 682-686 (2003); Lundquist, P. M. et al.“Parallel confocal detection of single molecules in real time.” Opt.Lett. 33, 1026-1028 (2008); Korlach, J. et al. “Selective aluminumpassivation for targeted immobilization of single DNA polymerasemolecules in zero-mode waveguide nano structures.” Proc. Natl. Acad.Sci. USA 105, 1176-1181 (2008), the disclosures of which areincorporated herein by reference in their entireties). Images obtainedfrom such methods can be stored, processed and analyzed as set forthherein.

The above SBS methods can be advantageously carried out in multiplexformats such that multiple different target nucleic acids aremanipulated simultaneously. In particular embodiments, different targetnucleic acids can be treated in a common reaction vessel or on a surfaceof a particular substrate. This allows convenient delivery of sequencingreagents, removal of unreacted reagents and detection of incorporationevents in a multiplex manner. In embodiments using surface-bound targetnucleic acids, the target nucleic acids can be in an array format. In anarray format, the target nucleic acids can be typically bound to asurface in a spatially distinguishable manner. The target nucleic acidscan be bound by direct covalent attachment, attachment to a bead orother particle or binding to a polymerase or other molecule that isattached to the surface. The array can include a single copy of a targetnucleic acid at each site (also referred to as a feature) or multiplecopies having the same sequence can be present at each site or feature.Multiple copies can be produced by amplification methods such as, bridgeamplification or emulsion PCR as described in further detail below.

The methods set forth herein can use arrays having features at any of avariety of densities including, for example, at least about 10features/cm², 100 features/cm², 500 features/cm², 1,000 features/cm²,5,000 features/cm², 10,000 features/cm², 50,000 features/cm², 100,000features/cm², 1,000,000 features/cm², 5,000,000 features/cm², or higher.

Systems

A system capable of carrying out a method set forth herein, whetherintegrated with detection capabilities or not, can include a systemcontroller that is capable of executing a set of instructions to performone or more steps of a method, technique or process set forth herein.For example, the instructions can direct the performance of steps forcreating a set of amplicons in situ. Optionally, the instructions canfurther direct the performance of steps for detecting nucleic acidsusing methods set forth previously herein. A useful system controllermay include any processor-based or microprocessor-based system,including systems using microcontrollers, reduced instruction setcomputers (RISC), application specific integrated circuits (ASICs),field programmable gate array (FPGAs), logic circuits, and any othercircuit or processor capable of executing functions described herein. Aset of instructions for a system controller may be in the form of asoftware program. As used herein, the terms “software” and “firmware”are interchangeable, and include any computer program stored in memoryfor execution by a computer, including RAM memory, ROM memory, EPROMmemory, EEPROM memory, and non-volatile RAM (NVRAM) memory. The softwaremay be in various forms such as system software or application software.Further, the software may be in the form of a collection of separateprograms, or a program module within a larger program or a portion of aprogram module. The software also may include modular programming in theform of object-oriented programming.

It will be appreciated that any of the above-described sequencingprocesses can be incorporated into the methods and/or systems describedherein. Furthermore, it will be appreciated that other known sequencingprocesses can be easily by implemented for use with the methods and/orsystems described herein. It will also be appreciated that the methodsand systems described herein are designed to be applicable with anynucleic acid sequencing technology. Additionally, it will be appreciatedthat the methods and systems described herein have even widerapplicability to any field where tracking and analysis of features in aspecimen over time or from different perspectives is important. Forexample, the methods and systems described herein can be applied whereimage data obtained by surveillance, aerial or satellite imagingtechnologies and the like is acquired at different time points orperspectives and analyzed.

EXAMPLES Example 1 Empirical Phasing Correction on 2 Channel Data

Empirical phasing was implemented in a 2 channel sequencing systemrunning whole genome sequencing of human samples. FIG. 1 showsrepresentative data from a particular tile and a particular cycle.Specifically, as shown in FIG. 1B, by using the phasing correctionmethod described below, a dramatically increased resolution results forintensities assigned to each of the four bases.

The fundamental idea of the empirical correction algorithm is thatphasing correction maximizes the cumulative chastity of the data. Usingthe correction algorithm described above, it is possible to iterate overall phasing correction values and establish which gives the bestresults. FIGS. 2A, 2B, and 2C depict intensity data for a two-channelsystem which has been subjected to various phasing corrections. FIG. 2Aillustrates cycle 150 from the sequencing run, where phasing isunder-corrected. FIG. 2B shows optimally corrected data. FIG. 2C showsovercorrected data. Clearly, the mean chastity of the data is maximizedwhen the assumed phasing rate is the true value.

This knowledge can be leveraged to estimate a phasing and pre-phasingcorrection parameter at every cycle which maximizes the chastity forthat cycle. To accomplish this, a first order phasing correction isimplemented:

I(cycle)=I(cycle)−A*I(cycle−1)−B*I(cycle+1)

Normally, the constants A and B are calculated from the estimatesphasing/pre-phasing rates and weighted by the cycle number. In anembodiment using empirical phasing correction, the method can optimizeover A and B at every cycle using a pattern search. The cost function isthe number of clusters that fail a chastity filter. Thus, A and B areselected to maximize the data quality.

To minimize the computational cost of effectively correcting at manydifferent phasing rates, then choosing the best one, the optimal A and Bvalues at every cycle were saved in the following file:

\Data\Intensities\BaseCalls\Phasing\EmpiricalPhasingCorrection_lane_read_tile.txt.

These data files have the following structure:

Cycle PhasingCorrection PrephasingCorrection

To determine the phasing or pre-phasing rate, the list ofPhasingCorrection was plotted by cycle. The phasing rate is the slope ofthe resulting line.

Example 2 Empirical Phasing Correction on Low Diversity 4 Channel Data

Four channel sequencing of low diversity samples such as singleamplicons presents several challenges, including low throughput, low %PF, and low quality scores. Even when a known phage genome (PhiX) wasspiked into the sample up to levels approaching 50%, these challengespersist.

A single amplicon sequencing run was performed utilizing empiricalphasing correction to give high quality data under extremely lowdiversity conditions. In this experiment, 3 separate single ampliconruns were performed with paired end runs of 101 cycles from each end. Aversion of real time analysis software (RTA version 1.17.23) was used toanalyze the four channel data. This RTA version included empiricalphasing. In all experiments, all cluster densities were greater than1000 k/mm² and the number of clusters passing filter was greater than90%. All sequencing data had a percent quality score above Q30 of 93%.These results demonstrate that empirical phasing on low diversitysequencing data yields superior data quality.

Throughout this application various publications, patents and/or patentapplications have been referenced. The disclosure of these publicationsin their entireties is hereby incorporated by reference in thisapplication.

The term comprising is intended herein to be open-ended, including notonly the recited elements, but further encompassing any additionalelements.

A number of embodiments have been described. Nevertheless, it will beunderstood that various modifications may be made. Accordingly, otherembodiments are within the scope of the following claims.

The following description is with respect to FIGS. 3-5. Embodimentsdescribed hereinafter are also described in U.S. Provisional ApplicationNo. 61/915,426, filed on Dec. 12, 2013, which is incorporated herein byreference in its entirety.

The analysis of image data presents a number of challenges, especiallywith respect to comparing images of an item or structure that arecaptured from different points of reference. Most image analysismethodology employs, at least in part, steps for aligning multipleseparate images with respect to each other based on characteristics orelements present in both images. Various embodiments of the compositionsand methods disclosed herein improve upon previous methods for imageanalysis. Some previous methods for image analysis are set forth in U.S.Patent Application Publication No. 2012/0020537 filed on Jan. 13, 2011and entitled, “DATA PROCESSING SYSTEM AND METHODS,” the content of whichis incorporated herein by reference in its entirety.

Recently, tools have been developed that acquire and analyze image datagenerated at different time points or perspectives. Some examplesinclude tools for analysis of satellite imagery and molecular biologytools for sequencing and characterizing the molecular identity of aspecimen. In any such system, acquiring and storing large numbers ofhigh-quality images typically requires massive amounts of storagecapacity. Additionally, once acquired and stored, the analysis of imagedata can become resource intensive and can interfere with processingcapacity of other functions, such as ongoing acquisition and storage ofadditional image data. As such, methods and systems which improve thespeed and accuracy of analysis of the acquisition and analysis of imagedata would be beneficial.

In the molecular biology field, one of the processes for nucleic acidsequencing in use is sequencing-by-synthesis. The technique can beapplied to massively parallel sequencing projects. For example, by usingan automated platform, it is possible to carry out hundreds of thousandsof sequencing reactions simultaneously. Thus, one of the embodiments ofthe present invention relates to instruments and methods for acquiring,storing, and analyzing image data generated during nucleic acidsequencing.

Enormous gains in the amount of data that can be acquired and storedmake streamlined image analysis methods even more beneficial. Forexample, the image analysis methods described herein permit bothdesigners and end users to make efficient use of existing computerhardware. Accordingly, presented herein are methods and systems whichreduce the computational burden of processing data in the face ofrapidly increasing data output. For example, in the field of DNAsequencing, yields have scaled 15-fold over the course of a recent year,and can now reach hundreds of gigabases in a single run of a DNAsequencing device. If computational infrastructure requirements grewproportionately, large genome-scale experiments would remain out ofreach to most researchers. Thus, the generation of more raw sequencedata will increase the need for secondary analysis and data storage,making optimization of data transport and storage extremely valuable.Some embodiments of the methods and systems presented herein can reducethe time, hardware, networking, and laboratory infrastructurerequirements needed to produce usable sequence data.

As used herein, a “feature” is an area of interest within a specimen orfield of view. When used in connection with microarray devices or othermolecular analytical devices, a feature refers to the area occupied bysimilar or identical molecules. For example, a feature can be anamplified oligonucleotide or any other group of a polynucleotide orpolypeptide with a same or similar sequence. In other embodiments, afeature can be any element or group of elements that occupy a physicalarea on a specimen. For example, a feature could be a parcel of land, abody of water or the like. When a feature is imaged, each feature willhave some area. Thus, in many embodiments, a feature is not merely onepixel.

The distances between features can be described in any number of ways.In some embodiments, the distances between features can be describedfrom the center of one feature to the center of another feature. Inother embodiments, the distances can be described from the edge of onefeature to the edge of another feature, or between the outer-mostidentifiable points of each feature. The edge of a feature can bedescribed as the theoretical or actual physical boundary on a chip, orsome point inside the boundary of the feature. In other embodiments, thedistances can be described in relation to a fixed point on the specimenor in the image of the specimen.

Multiple copies of nucleic acids at a feature can be sequenced, forexample, by providing a labeled nucleotide base to the array ofmolecules, thereby extending a primer hybridized to a nucleic acidwithin a feature so as to produce a signal corresponding to a featurecomprising the nucleic acid. In preferred embodiments, the nucleic acidswithin a feature are identical or substantially identical to each other.

In some of the image analysis methods described herein, each image inthe set of images includes colors signals, wherein a different colorcorresponds to a different nucleotide base. In some aspects, each imageof the set of images comprises signals having a single color selectedfrom at least four different colors. In certain aspects, each image inthe set of images comprises signals having a single color selected fromfour different colors.

With respect to certain four-channel methods described herein, nucleicacids can be sequenced by providing, four different labeled nucleotidebases to the array of molecules so as to produce four different images,each image comprising signals having a single color, wherein the signalcolor is different for each of the four different images, therebyproducing a cycle of four color images that corresponds to the fourpossible nucleotides present at a particular position in the nucleicacid. In certain aspects, such methods can further comprise providingadditional labeled nucleotide bases to the array of molecules, therebyproducing a plurality of cycles of color images.

With respect to certain two-channel methods described herein, nucleicacids can be sequenced utilizing methods and systems described in U.S.Patent Application Publication No. 2013/0079232, the disclosure of whichis incorporated herein by reference in its entirety. As a first example,a nucleic acid can be sequenced by providing a first nucleotide typethat is detected in a first channel, a second nucleotide type that isdetected in a second channel, a third nucleotide type that is detectedin both the first and the second channel and a fourth nucleotide typethat lacks a label that is not, or minimally, detected in eitherchannel. In certain aspects, such methods can further comprise providingadditional labeled nucleotide bases to the array of molecules, therebyproducing a plurality of cycles of color images.

Base Calling

Presented herein are methods and systems for identifying a nucleotidebase in a nucleic acid sequence, or “base calling.” Base calling refersto the process of determining a base call (A, C, G, T) for every featureof a given tile at a specific cycle. As an example, SBS can be performedutilizing two-channel methods and systems described in the incorporatedmaterials of U.S. Patent Application Publication No. 2013/0079232. Forexample, in embodiments that make use of two-channel detection, basecalling is performed by extracting image data from two images, ratherthan four. Because of the fundamental differences involved intwo-channel base calling, traditional base calling approaches as appliedto four channel detection is not compatible with two-channel data. Inview of these differences, a new approach for base calling is required.Accordingly, presented herein are methods and systems for base callingin a 2-channel system. In some embodiments, the methods compriseiteratively fitting four Gaussian distributions to intensity data fromtwo channels. When signals from channel 1 are plotted against signalsfrom channel 2, signal intensity typically segregates into four generalpopulations of intensity. As shown in FIG. 3, data from a 2 channelsequencing system can be plotted as intensity values from channel 1(x-axis) versus intensity values from channel 2 (y-axis). In typicalembodiments, one of the four nucleotides is unlabeled (dark), such as“G” nucleotide shown in FIG. 3, which has near zero signal in bothchannel 1 and channel 2. The signals from a certain portion of the datapoints are clustered near the zero point in each axis. Likewise, thesignals from a certain portion of the data points labeled with one orboth labels (shown as “C”, “A”, and “T” nucleotides in FIG. 3) formidentifiable populations when plotted in a two-dimensional graph such asthe one shown in FIG. 3. Thus, for example, unlike four-channelsequencing data, the intensity itself of a particular label does notencode the base. Rather, the combination of intensities, [on, off],[off, on], [on, on], [off, off], provide the encoding information forthe base identity.

The methods and systems presented herein provide a tool for identifyingthe base associated with any one particular data point in such datasets. An objective of the methods and systems presented herein is toseparate the four populations as accurately as possible.

Classifiers

In some embodiments presented herein, base calling is performed byfitting a mathematical model to a set of intensity data. Any suitablemathematical model can be used in the methods presented herein in orderto fit the intensity data to a set of distributions. Mathematical modelsthat can be used in the methods presented herein can include classifierssuch as, for example, a k-means clustering algorithm, a k-means-likeclustering algorithm, expectation maximization, a histogram basedmethod, and the like.

For example, in certain embodiments, one or more Gaussian distributionsare fitted to a set of intensity data. In certain embodiments, 4Gaussian distributions are fit to a set of two-channel intensity datasuch that one distribution is applied for each of the four nucleotidesrepresented in the data set. In certain embodiments, intensity valuescan be normalized prior to fitting a Gaussian distribution. For example,as shown in the exemplary embodiment represented by FIG. 4, intensityvalues are normalized so that 5^(th) and 95^(th) percentiles have valuesof 0 and 1, respectively. Four Gaussian distributions are then fit tothe data using an algorithm such as, for example an expectationmaximization (EM) clustering algorithm. EM algorithms are known in theart and are useful tools to construct statistical models of theunderlying data source and naturally generalize to cluster databasescontaining both discrete-valued and continuous-valued data. Thus, forexample, in certain embodiments, an EM algorithm is applied toiteratively maximize the likelihood of observing the given data. Forexample, an EM algorithm is applied to iteratively maximize thislikelihood over the mean and covariance for each of the Gaussiandistributions. In certain embodiments, a subset of the data points in adata set is included in the calculation. Additionally or alternatively,in certain embodiments, all or substantially all data points in the dataset are included in the calculation.

As a result of the EM algorithm, for each X, Y value (referring to eachof the two channel intensities respectively) a value can be generatedwhich represents the likelihood that a certain X, Y intensity valuebelongs to one of the four distributions. In an embodiment where fourbases give four separate distributions, each X, Y intensity value willalso have four associated likelihood values, one for each of the fourbases. The maximum of the four likelihood values indicates the basecall. In FIGS. 5A and 5B, intensity values for a two-channel data setare assigned a base call after performing a Gaussian fit to the dataset. Each data point in FIGS. 5A and 5B has a color associated with theassigned base call, which represents the maximum of the likelihoodprediction values. A comparison of the base call data shown in FIGS. 5Aand 5B indicates that the base calling methods presented herein arehighly accurate and are robust to varying types of sequencing chemistry.For example, FIG. 5A is an example of chemistry that forms fourintensity distributions forming a square when the intensity values areplotted. In contrast, the intensity plot in FIG. 5B has four intensitydistributions that fall within a triangle, based on the lesserintensities of the dual-labeled nucleotide. In both types of chemistry,the base calling methods presented herein provide accurate base calls.

In embodiments of the methods presented herein, a quality score can alsobe generated based on the Gaussian distribution approach to basecalling. For example, the distance of a point to the center of the“called” distribution gives a measure of the purity of the base call.Specifically, the closer a data point lies to the center of thedistribution for the called base, the greater the likelihood that thebase call is accurate. Any suitable method to calculate and express therelationship between distance to the center and the likely purity of thebase call can be used in the methods provided herein. In someembodiments, the quality or purity of the base call for a given datapoint can be expressed as the distance to the nearest centroid dividedby the sum of all distances to each of the other three centroids. Insome embodiments, the quality or purity of the base call for a givendata point can be expressed as the distance to the nearest centroid bedivided by the distance to the second nearest centroid, as describedbelow regarding chastity filtering.

Chastity Filtering

Also presented herein are methods of filtering clusters having poorquality. The term filtering as used in relation to clusters andbasecalling refers to discarding or disregarding the cluster as a datapoint. Thus, any clusters of poor intensity or quality can be filteredand are not included in an output data set. In certain embodiments,cluster quality is determined by a metric termed chastity. Chastity fortwo-channel basecalling takes on a separate meaning from the use of theterm in four-channel basecalling. For example, as described in theincorporated materials of U.S. Patent Application Publication No.2012/0020537, chastity is defined in terms of intensity of a cluster(“spot”) relative to a nearby spot), and can be calculated as thehighest intensity value divide by the sum of the highest intensity valueand the second highest intensity value, where the intensity values areobtained from four color channels. However, because two-channelbasecalling typically utilizes unlabeled nucleotides that emit very lowor no signal, traditional chastity determinations are unsuitable fortwo-channel basecalling.

Thus, some embodiments of the present disclosure relate to determiningchastity of a cluster as a function of relative distances to Gaussiancentroids. In some embodiments, clusters that are not close enough toone particular Gaussian centroid in a given number of cycles are given alow chastity value and are filtered out. For example, in one specificembodiment, chastity can be calculated using the expression:

chastity=1−D1/(D1+D2),

where D1 is the distance to the nearest Gaussian centroid, and D2 is thedistance to the next nearest centroid. Methods of fitting Gaussiandistributions to a two-channel data set are described hereinabove in thesection describing basecalling methods.

In some embodiments, filtering of low-chastity clusters takes place atone or more discrete points during a sequencing run. In someembodiments, filtering occurs during template generation. Alternativelyor additionally, in some embodiments, filtering occurs after apredefined cycle. In certain embodiments, filtering occurs at or aftercycle 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or after cycle 30 or later. Intypical embodiments, filtering occurs at cycle 25, such that clustersthat are not close enough to a Gaussian centroid in the first 25 cyclesare filtered out.

Sequencing Methods

The methods described herein can be used in conjunction with a varietyof nucleic acid sequencing techniques. Particularly applicabletechniques are those wherein nucleic acids are attached at fixedlocations in an array such that their relative positions do not changeand wherein the array is repeatedly imaged. Embodiments in which imagesare obtained in different color channels, for example, coinciding withdifferent labels used to distinguish one nucleotide base type fromanother are particularly applicable. In some embodiments, the process todetermine the nucleotide sequence of a target nucleic acid can be anautomated process. Preferred embodiments include sequencing-by-synthesis(“SBS”) techniques.

SBS techniques generally involve the enzymatic extension of a nascentnucleic acid strand through the iterative addition of nucleotidesagainst a template strand. In traditional methods of SBS, a singlenucleotide monomer may be provided to a target nucleotide in thepresence of a polymerase in each delivery. However, in the methodsdescribed herein, more than one type of nucleotide monomer can beprovided to a target nucleic acid in the presence of a polymerase in adelivery.

SBS can utilize nucleotide monomers that have a terminator moiety orthose that lack any terminator moieties. Methods utilizing nucleotidemonomers lacking terminators include, for example, pyrosequencing andsequencing using γ-phosphate-labeled nucleotides, as set forth infurther detail below. In methods using nucleotide monomers lackingterminators, the number of nucleotides added in each cycle is generallyvariable and dependent upon the template sequence and the mode ofnucleotide delivery. For SBS techniques that utilize nucleotide monomershaving a terminator moiety, the terminator can be effectivelyirreversible under the sequencing conditions used as is the case fortraditional Sanger sequencing which utilizes dideoxynucleotides, or theterminator can be reversible as is the case for sequencing methodsdeveloped by Solexa (now Illumina, Inc.).

SBS techniques can utilize nucleotide monomers that have a label moietyor those that lack a label moiety. Accordingly, incorporation events canbe detected based on a characteristic of the label, such as fluorescenceof the label; a characteristic of the nucleotide monomer such asmolecular weight or charge; a byproduct of incorporation of thenucleotide, such as release of pyrophosphate; or the like. Inembodiments, where two or more different nucleotides are present in asequencing reagent, the different nucleotides can be distinguishablefrom each other, or alternatively, the two or more different labels canbe the indistinguishable under the detection techniques being used. Forexample, the different nucleotides present in a sequencing reagent canhave different labels and they can be distinguished using appropriateoptics as exemplified by the sequencing methods developed by Solexa (nowIllumina, Inc.).

Preferred embodiments include pyrosequencing techniques. Pyrosequencingdetects the release of inorganic pyrophosphate (PPi) as particularnucleotides are incorporated into the nascent strand (Ronaghi, M.,Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996)“Real-time DNA sequencing using detection of pyrophosphate release.”Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) “Pyrosequencingsheds light on DNA sequencing.” Genome Res. 11(1), 3-11; Ronaghi, M.,Uhlen, M. and Nyren, P. (1998) “A sequencing method based on real-timepyrophosphate.” Science 281(5375), 363; U.S. Pat. Nos. 6,210,891;6,258,568 and 6,274,320, the disclosures of which are incorporatedherein by reference in their entireties). In pyrosequencing, releasedPPi can be detected by being immediately converted to adenosinetriphosphate (ATP) by ATP sulfurylase, and the level of ATP generated isdetected via luciferase-produced photons. The nucleic acids to besequenced can be attached to features in an array and the array can beimaged to capture the chemiluminscent signals that are produced due toincorporation of a nucleotides at the features of the array. An imagecan be obtained after the array is treated with a particular nucleotidetype (e.g. A, T, C or G). Images obtained after addition of eachnucleotide type will differ with regard to which features in the arrayare detected. These differences in the image reflect the differentsequence content of the features on the array. However, the relativelocations of each feature will remain unchanged in the images. Theimages can be stored, processed and analyzed using the methods set forthherein. For example, images obtained after treatment of the array witheach different nucleotide type can be handled in the same way asexemplified herein for images obtained from different detection channelsfor reversible terminator-based sequencing methods.

In another exemplary type of SBS, cycle sequencing is accomplished bystepwise addition of reversible terminator nucleotides containing, forexample, a cleavable or photobleachable dye label as described, forexample, in WO 04/018497 and U.S. Pat. No. 7,057,026, the disclosures ofwhich are incorporated herein by reference. This approach is beingcommercialized by Solexa (now Illumina Inc.), and is also described inWO 91/06678 and WO 07/123,744, each of which is incorporated herein byreference. The availability of fluorescently-labeled terminators inwhich both the termination can be reversed and the fluorescent labelcleaved facilitates efficient cyclic reversible termination (CRT)sequencing. Polymerases can also be co-engineered to efficientlyincorporate and extend from these modified nucleotides.

Preferably in reversible terminator-based sequencing embodiments, thelabels do not substantially inhibit extension under SBS reactionconditions. However, the detection labels can be removable, for example,by cleavage or degradation. Images can be captured followingincorporation of labels into arrayed nucleic acid features. Inparticular embodiments, each cycle involves simultaneous delivery offour different nucleotide types to the array and each nucleotide typehas a spectrally distinct label. Four images can then be obtained, eachusing a detection channel that is selective for one of the fourdifferent labels. Alternatively, different nucleotide types can be addedsequentially and an image of the array can be obtained between eachaddition step. In such embodiments each image will show nucleic acidfeatures that have incorporated nucleotides of a particular type.Different features will be present or absent in the different images duethe different sequence content of each feature. However, the relativeposition of the features will remain unchanged in the images. Imagesobtained from such reversible terminator-SBS methods can be stored,processed and analyzed as set forth herein. Following the image capturestep, labels can be removed and reversible terminator moieties can beremoved for subsequent cycles of nucleotide addition and detection.Removal of the labels after they have been detected in a particularcycle and prior to a subsequent cycle can provide the advantage ofreducing background signal and crosstalk between cycles. Examples ofuseful labels and removal methods are set forth below.

In particular embodiments some or all of the nucleotide monomers caninclude reversible terminators. In such embodiments, reversibleterminators/cleavable fluors can include fluor linked to the ribosemoiety via a 3′ ester linkage (Metzker, Genome Res. 15:1767-1776 (2005),which is incorporated herein by reference). Other approaches haveseparated the terminator chemistry from the cleavage of the fluorescencelabel (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), whichis incorporated herein by reference in its entirety). Ruparel et aldescribed the development of reversible terminators that used a small 3′allyl group to block extension, but could easily be deblocked by a shorttreatment with a palladium catalyst. The fluorophore was attached to thebase via a photocleavable linker that could easily be cleaved by a 30second exposure to long wavelength UV light. Thus, either disulfidereduction or photocleavage can be used as a cleavable linker. Anotherapproach to reversible termination is the use of natural terminationthat ensues after placement of a bulky dye on a dNTP. The presence of acharged bulky dye on the dNTP can act as an effective terminator throughsteric and/or electrostatic hindrance. The presence of one incorporationevent prevents further incorporations unless the dye is removed.Cleavage of the dye removes the fluor and effectively reverses thetermination. Examples of modified nucleotides are also described in U.S.Pat. Nos. 7,427,673, and 7,057,026, the disclosures of which areincorporated herein by reference in their entireties.

Additional exemplary SBS systems and methods which can be utilized withthe methods and systems described herein are described in U.S. PatentApplication Publication No. 2007/0166705, U.S. Patent ApplicationPublication No. 2006/0188901, U.S. Pat. No. 7,057,026, U.S. PatentApplication Publication No. 2006/0240439, U.S. Patent ApplicationPublication No. 2006/0281109, PCT Publication No. WO 05/065814, U.S.Patent Application Publication No. 2005/0100900, PCT Publication No. WO06/064199, PCT Publication No. WO 07/010,251, U.S. Patent ApplicationPublication No. 2012/0270305 and U.S. Patent Application Publication No.2013/0260372, the disclosures of which are incorporated herein byreference in their entireties.

Some embodiments can utilize detection of four different nucleotidesusing fewer than four different labels. For example, SBS can beperformed utilizing methods and systems described in the incorporatedmaterials of U.S. Patent Application Publication No. 2013/0079232. As afirst example, a pair of nucleotide types can be detected at the samewavelength, but distinguished based on a difference in intensity for onemember of the pair compared to the other, or based on a change to onemember of the pair (e.g. via chemical modification, photochemicalmodification or physical modification) that causes apparent signal toappear or disappear compared to the signal detected for the other memberof the pair. As a second example, three of four different nucleotidetypes can be detected under particular conditions while a fourthnucleotide type lacks a label that is detectable under those conditions,or is minimally detected under those conditions (e.g., minimal detectiondue to background fluorescence, etc). Incorporation of the first threenucleotide types into a nucleic acid can be determined based on presenceof their respective signals and incorporation of the fourth nucleotidetype into the nucleic acid can be determined based on absence or minimaldetection of any signal. As a third example, one nucleotide type caninclude label(s) that are detected in two different channels, whereasother nucleotide types are detected in no more than one of the channels.The aforementioned three exemplary configurations are not consideredmutually exclusive and can be used in various combinations. An exemplaryembodiment that combines all three examples, is a fluorescent-based SBSmethod that uses a first nucleotide type that is detected in a firstchannel (e.g. dATP having a label that is detected in the first channelwhen excited by a first excitation wavelength), a second nucleotide typethat is detected in a second channel (e.g. dCTP having a label that isdetected in the second channel when excited by a second excitationwavelength), a third nucleotide type that is detected in both the firstand the second channel (e.g. dTTP having at least one label that isdetected in both channels when excited by the first and/or secondexcitation wavelength) and a fourth nucleotide type that lacks a labelthat is not, or minimally, detected in either channel (e.g. dGTP havingno label).

Further, as described in the incorporated materials of U.S. PatentApplication Publication No. 2013/0079232, sequencing data can beobtained using a single channel. In such so-called one-dye sequencingapproaches, the first nucleotide type is labeled but the label isremoved after the first image is generated, and the second nucleotidetype is labeled only after a first image is generated. The thirdnucleotide type retains its label in both the first and second images,and the fourth nucleotide type remains unlabeled in both images.

Some embodiments can utilize sequencing by ligation techniques. Suchtechniques utilize DNA ligase to incorporate oligonucleotides andidentify the incorporation of such oligonucleotides. Theoligonucleotides typically have different labels that are correlatedwith the identity of a particular nucleotide in a sequence to which theoligonucleotides hybridize. As with other SBS methods, images can beobtained following treatment of an array of nucleic acid features withthe labeled sequencing reagents. Each image will show nucleic acidfeatures that have incorporated labels of a particular type. Differentfeatures will be present or absent in the different images due thedifferent sequence content of each feature, but the relative position ofthe features will remain unchanged in the images. Images obtained fromligation-based sequencing methods can be stored, processed and analyzedas set forth herein. Exemplary SBS systems and methods which can beutilized with the methods and systems described herein are described inU.S. Pat. Nos. 6,969,488, 6,172,218, and 6,306,597, the disclosures ofwhich are incorporated herein by reference in their entireties.

Some embodiments can utilize nanopore sequencing (Deamer, D. W. &Akeson, M. “Nanopores and nucleic acids: prospects for ultrarapidsequencing.” Trends Biotechnol. 18, 147-151 (2000); Deamer, D. and D.Branton, “Characterization of nucleic acids by nanopore analysis”. Acc.Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin,and J. A. Golovchenko, “DNA molecules and configurations in asolid-state nanopore microscope” Nat. Mater. 2:611-615 (2003), thedisclosures of which are incorporated herein by reference in theirentireties). In such embodiments, the target nucleic acid passes througha nanopore. The nanopore can be a synthetic pore or biological membraneprotein, such as α-hemolysin. As the target nucleic acid passes throughthe nanopore, each base-pair can be identified by measuring fluctuationsin the electrical conductance of the pore. (U.S. Pat. No. 7,001,792;Soni, G. V. & Meller, “A. Progress toward ultrafast DNA sequencing usingsolid-state nanopores.” Clin. Chem. 53, 1996-2001 (2007); Healy, K.“Nanopore-based single-molecule DNA analysis.” Nanomed. 2, 459-481(2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. “Asingle-molecule nanopore device detects DNA polymerase activity withsingle-nucleotide resolution.” J. Am. Chem. Soc. 130, 818-820 (2008),the disclosures of which are incorporated herein by reference in theirentireties). Data obtained from nanopore sequencing can be stored,processed and analyzed as set forth herein. In particular, the data canbe treated as an image in accordance with the exemplary treatment ofoptical images and other images that is set forth herein.

Some embodiments can utilize methods involving the real-time monitoringof DNA polymerase activity. Nucleotide incorporations can be detectedthrough fluorescence resonance energy transfer (FRET) interactionsbetween a fluorophore-bearing polymerase and γ-phosphate-labelednucleotides as described, for example, in U.S. Pat. Nos. 7,329,492 and7,211,414 (each of which is incorporated herein by reference) ornucleotide incorporations can be detected with zero-mode waveguides asdescribed, for example, in U.S. Pat. No. 7,315,019 (which isincorporated herein by reference) and using fluorescent nucleotideanalogs and engineered polymerases as described, for example, in U.S.Pat. No. 7,405,281 and U.S. Patent Application Publication No.2008/0108082 (each of which is incorporated herein by reference). Theillumination can be restricted to a zeptoliter-scale volume around asurface-tethered polymerase such that incorporation of fluorescentlylabeled nucleotides can be observed with low background (Levene, M. J.et al. “Zero-mode waveguides for single-molecule analysis at highconcentrations.” Science 299, 682-686 (2003); Lundquist, P. M. et al.“Parallel confocal detection of single molecules in real time.” Opt.Lett. 33, 1026-1028 (2008); Korlach, J. et al. “Selective aluminumpassivation for targeted immobilization of single DNA polymerasemolecules in zero-mode waveguide nano structures.” Proc. Natl. Acad.Sci. USA 105, 1176-1181 (2008), the disclosures of which areincorporated herein by reference in their entireties). Images obtainedfrom such methods can be stored, processed and analyzed as set forthherein.

The above SBS methods can be advantageously carried out in multiplexformats such that multiple different target nucleic acids aremanipulated simultaneously. In particular embodiments, different targetnucleic acids can be treated in a common reaction vessel or on a surfaceof a particular substrate. This allows convenient delivery of sequencingreagents, removal of unreacted reagents and detection of incorporationevents in a multiplex manner. In embodiments using surface-bound targetnucleic acids, the target nucleic acids can be in an array format. In anarray format, the target nucleic acids can be typically bound to asurface in a spatially distinguishable manner. The target nucleic acidscan be bound by direct covalent attachment, attachment to a bead orother particle or binding to a polymerase or other molecule that isattached to the surface. The array can include a single copy of a targetnucleic acid at each site (also referred to as a feature) or multiplecopies having the same sequence can be present at each site or feature.Multiple copies can be produced by amplification methods such as, bridgeamplification or emulsion PCR as described in further detail below.

The methods set forth herein can use arrays having features at any of avariety of densities including, for example, at least about 10features/cm², 100 features/cm², 500 features/cm², 1,000 features/cm²,5,000 features/cm², 10,000 features/cm², 50,000 features/cm², 100,000features/cm², 1,000,000 features/cm², 5,000,000 features/cm², or higher.

Systems

A system capable of carrying out a method set forth herein, whetherintegrated with detection capabilities or not, can include a systemcontroller that is capable of executing a set of instructions to performone or more steps of a method, technique or process set forth herein.For example, the instructions can direct the performance of steps forcreating a set of amplicons in situ. Optionally, the instructions canfurther direct the performance of steps for detecting nucleic acidsusing methods set forth previously herein. A useful system controllermay include any processor-based or microprocessor-based system,including systems using microcontrollers, reduced instruction setcomputers (RISC), application specific integrated circuits (ASICs),field programmable gate array (FPGAs), logic circuits, and any othercircuit or processor capable of executing functions described herein. Aset of instructions for a system controller may be in the form of asoftware program. As used herein, the terms “software” and “firmware”are interchangeable, and include any computer program stored in memoryfor execution by a computer, including RAM memory, ROM memory, EPROMmemory, EEPROM memory, and non-volatile RAM (NVRAM) memory. The softwaremay be in various forms such as system software or application software.Further, the software may be in the form of a collection of separateprograms, or a program module within a larger program or a portion of aprogram module. The software also may include modular programming in theform of object-oriented programming.

It will be appreciated that any of the above-described sequencingprocesses can be incorporated into the methods and/or systems describedherein. Furthermore, it will be appreciated that other known sequencingprocesses can be easily by implemented for use with the methods and/orsystems described herein. It will also be appreciated that the methodsand systems described herein are designed to be applicable with anynucleic acid sequencing technology. Additionally, it will be appreciatedthat the methods and systems described herein have even widerapplicability to any field where tracking and analysis of features in aspecimen over time or from different perspectives is important. Forexample, the methods and systems described herein can be applied whereimage data obtained by surveillance, aerial or satellite imagingtechnologies and the like is acquired at different time points orperspectives and analyzed.

EXAMPLES Example 1 Base Calling Using Gaussian Distribution on 2 ChannelData

Base calling is performed in a 2 channel sequencing system running wholegenome sequencing of human samples. After template generation, intensityvalues are generated for two separate imaging channels. The intensityvalues are normalized so that the 5^(th) and 95^(th) percentiles occurat 0 and 1, and four Gaussian distributions are fit to the data using anExpectation Maximization algorithm. A centroid (mean X,Y value) for eachof the four distributions corresponding to each of the four nucleotidesis calculated.

Basecalling for each cluster occurs by measuring the likelihood valuecalculated, which is the likelihood that the cluster is belonging toeach of the four distributions. The centroid associated with the maximumlikelihood value is selected as the basecall. This basecall process isperformed for each of the clusters in the data set for each cycle.

Throughout this application various publications, patents and/or patentapplications have been referenced. The disclosure of these publicationsin their entireties is hereby incorporated by reference in thisapplication.

The term comprising is intended herein to be open-ended, including notonly the recited elements, but further encompassing any additionalelements.

A number of embodiments have been described. Nevertheless, it will beunderstood that various modifications may be made. Accordingly, otherembodiments are within the scope of the following claims.

FIGS. 6-9 include flowcharts that illustrate one or more methods. FIG. 6illustrates a method 100 in accordance with an embodiment. The method100 may be, for example, a method of evaluating the quality of a basecall from a sequencing read. The method 100 may include receiving, at102, a sequencing read having a number of base calls. The method 100 mayalso include calculating, at 104, a set of predictor values for a basecall and using, at 106, the predictor values to look up a quality score(or similar metric) in a quality table (or database).

In one aspect, the sequencing read utilizes two-channel base calling.

In another aspect, the sequencing read utilizes one-channel basecalling.

In another aspect, the quality table is generated using Phred scoring ona calibration data set. The calibration set is representative of run andsequence variability. In some embodiments, the method 100 may includegenerating the quality table.

In another aspect, the predictor values are selected from the groupconsisting of: online overlap; purity; phasing; start5; hexamer score;motif accumulation; endiness; approximate homopolymer; intensity decay;penultimate chastity; and signal overlap with background (SOWB). Inparticular embodiments, the set of predictor values comprises onlineoverlap; purity; phasing; and start5. In particular embodiments, the setof predictor values comprises hexamer score; and motif accumulation.

In another aspect, the method also includes the steps of discounting, at108, unreliable quality scores at the end of each read. The method 100may also include identifying, at 110, reads where the second worstchastity in the first 25 base calls is below a pre-established thresholdand marking the reads as poor quality data.

In another aspect, the discounting, at 108, may include using analgorithm to identify a threshold of reliability.

In another aspect, reliable base calls include q-values, or other valuesindicative of data quality or statistical significance, above thethreshold and unreliable base calls comprise q-values, or other valuesindicative of data quality or statistical significance, below thethreshold.

In another aspect, the algorithm comprises an End Anchored MaximalScoring Segments (EAMSS) algorithm.

In another aspect, the algorithm uses a Hidden Markov Model thatidentifies shifts in the local distributions of quality scores.

In an embodiment, a system for evaluating the quality of a base callfrom a sequencing read is provided. The system includes a processor, astorage capacity, and a program for evaluating the quality of a basecall from a sequencing read. The program includes instructions for (a)calculating a set of predictor values for the base call and (b) usingthe predictor values to look up a quality score in a quality table.

In another aspect, the sequencing read utilizes two-channel basecalling.

In another aspect, the sequencing read utilizes one-channel basecalling.

In another aspect, the quality table is generated using Phred scoring ona calibration data set, the calibration set being representative of runand sequence variability.

In another aspect, the predictor values are selected from the groupconsisting of: online overlap; purity; phasing; start5; hexamer score;motif accumulation; endiness; approximate homopolymer; intensity decay;penultimate chastity; and signal overlap with background (SOWB).Optionally, the set of predictor values comprises online overlap;purity; phasing; and start5. Optionally, the set of predictor valuescomprises hexamer score; and motif accumulation.

In another aspect, the program also includes instructions for (c)discounting unreliable quality scores at the end of each read and (d)identifying reads where the second worst chastity in the first 25 basecalls is below a pre-established threshold and marking the reads as poorquality data.

In another aspect, step (c) may include using an algorithm to identify athreshold of reliability.

In another aspect, reliable base calls comprise q-values, or othervalues indicative of data quality or statistical significance, above thethreshold and unreliable base calls comprise q-values, or other valuesindicative of data quality or statistical significance, below thethreshold.

In another aspect, the algorithm comprises an End Anchored MaximalScoring Segments (EAMSS) algorithm.

In another aspect, the algorithm uses a Hidden Markov Model thatidentifies shifts in the local distributions of quality scores.

FIG. 7 illustrates a method 120 in accordance with an embodiment. Themethod 120 may include, for example, a method of generating aphasing-corrected intensity value. The method includes (a) performing,at 122, a plurality of cycles of a sequencing by synthesis reaction suchthat, at each cycle, a signal is generated indicative of incorporationof the same nucleotide into a plurality of identical polynucleotides,whereby a portion of the signal is noise associated with a nucleotideincorporated during a previous cycle. The method also includes (b)detecting, at 124, the signal at each cycle. The signal has an intensityvalue. The method 120 also includes (c) correcting, at 126, theintensity value for phasing by applying a first order phasing correctionto the intensity value, wherein a new first order phasing correction iscalculated for each cycle.

In one aspect, the first order phasing correction comprises subtractingan intensity value from the immediately previous cycle from theintensity value of the current cycle.

In another aspect, the method includes subtracting an intensity valuefrom the immediately subsequent cycle from the intensity value of thecurrent cycle.

In another aspect the phasing correction comprises:

I _((cycle)corrected) =I _((cycle)N) −X*I _((cycle)N−1) −Y*I_((cycle)N+1).

In another aspect, the values of X and/or Y are chosen to optimize achastity determination. Optionally, the chastity determination comprisesmean chastity.

In another aspect, the sequencing run utilizes two-channel base calling.

In another aspect, the sequencing run utilizes one-channel base calling.

In another aspect, the sequencing run utilizes four-channel basecalling.

In an embodiment, a system for generating a phasing-corrected intensityvalue is provided. The system includes a processor, a storage capacity,and a program for generating a phasing-corrected intensity value. Theprogram includes instructions for (a) performing a plurality of cyclesof a sequencing by synthesis reaction such that, at each cycle, a signalis generated indicative of incorporation of the same nucleotide into aplurality of identical polynucleotides, whereby a portion of the signalis noise associated with a nucleotide incorporated during a previouscycle. The program includes instructions for (b) detecting the signal ateach cycle, wherein the signal has an intensity value, and (c)correcting the intensity value for phasing by applying a first orderphasing correction to the intensity value. A new first order phasingcorrection is calculated for each cycle.

In one aspect, the first order phasing correction comprises subtractingan intensity value from the immediately previous cycle from theintensity value of the current cycle.

In another aspect, the method includes subtracting an intensity valuefrom the immediately subsequent cycle from the intensity value of thecurrent cycle.

In another aspect, the phasing correction comprises:

I _((cycle)corrected) =I _((cycle)N) −X*I _((cycle)N−1) −Y*I_((cycle)N+1).

In another aspect, the values of X and/or Y are chosen to optimize achastity determination. Optionally, the chastity determination comprisesmean chastity.

In another aspect, the sequencing run utilizes two-channel base calling.

In another aspect, the sequencing run utilizes one-channel base calling.

In another aspect, the sequencing run utilizes four-channel basecalling.

FIG. 8 illustrates a method 140 in accordance with an embodiment. Themethod 140 may be, for example, a method of identifying a nucleotidebase. The method 140 includes detecting, at 142, the presence or absenceof a signal in two different channels for each of a plurality offeatures on an array at a particular time, thereby generating a firstset of intensity values and a second set of intensity values for each ofthe features. The combination of intensity values in each of the twochannels corresponds to one of four different nucleotide bases. Themethod also includes, at 144, fitting four Gaussian distributions to theintensity values. Each distribution has a centroid. The method alsoincludes calculating, at 146, a likelihood value that indicates thelikelihood of a particular feature belonging to each of the fourdistributions. The method also includes selecting, at 148, for eachfeature of said plurality of features the distribution having thehighest likelihood value. This distribution corresponds to the identityof the nucleotide base present at the particular feature.

In one aspect, fitting includes using one or more algorithms from thegroup consisting of: a k-means clustering algorithm, a k-means-likeclustering algorithm, an Expectation Maximization algorithm, and ahistogram based method. In particular embodiments, fitting includesusing an Expectation Maximization algorithm.

In another aspect, the method includes normalizing the intensity values.

In another aspect, a chastity value is calculated for each feature. Thechastity value may be a function of the relative distance from a featureto the two nearest Gaussian centroids.

In another aspect, features having a chastity value below a thresholdvalue are filtered out.

In an embodiment, a system for evaluating the quality of a base callfrom a sequencing read is provided. The system includes a processor, astorage capacity, and a program for identifying a nucleotide base. Theprogram includes instructions for detecting the presence or absence of asignal in two different channels for each of a plurality of features onan array at a particular time, thereby generating a first set ofintensity values and a second set of intensity values for each of thefeatures. The combination of intensity values in each of the twochannels corresponds to one of four different nucleotide bases. Theprogram also includes instructions for fitting four Gaussiandistributions to the intensity values. Each distribution has a centroid.The program also includes instructions for calculating a likelihoodvalue that indicates the likelihood of a particular feature belonging toeach of the four distributions and selecting for each feature of saidplurality of features the distribution having the highest likelihoodvalue. Said distribution corresponds to the identity of the nucleotidebase present at said particular feature.

In one aspect, fitting includes using one or more algorithms from thegroup consisting of: a k-means clustering algorithm, a k-means-likeclustering algorithm, an Expectation Maximization algorithm, and ahistogram based method. In particular embodiments, fitting comprisesusing an Expectation Maximization algorithm.

In another aspect, the program includes instructions for normalizing theintensity values.

In another aspect, the program includes instructions for calculating achastity value for each feature. The chastity value may be a function ofthe relative distance from a feature to the two nearest Gaussiancentroids. Optionally, features having a chastity value below athreshold value are filtered out.

FIG. 9 illustrates a method 160 in accordance with an embodiment. Themethod 160 may be, for example, a method of identifying a nucleotidebase. The method 160 includes obtaining, at 162, a first set ofintensity values and a second set of intensity values for each of aplurality of features on an array. The intensity value for each featurein one or both sets corresponds to the presence or absence of aparticular nucleotide base out of four possible nucleotide bases at thefeature. The method also includes fitting, at 164, four Gaussiandistributions to the intensity values. Each distribution has a centroid.The method also includes calculating, at 166, four likelihood values foreach feature, wherein each likelihood value indicates the likelihood ofa particular feature belonging to one of the four distributions. Themethod also includes selecting, at 168, for each feature of saidplurality of features the distribution having the highest of the fourlikelihood values. The distribution corresponds to the identity of thenucleotide base present at the particular feature.

In one aspect, fitting includes using one or more algorithms from thegroup consisting of: a k-means clustering algorithm, a k-means-likeclustering algorithm, an Expectation Maximization algorithm, and ahistogram based method. In particular embodiments, fitting includesusing an Expectation Maximization algorithm.

In another aspect, the method also includes normalizing the intensityvalues.

In another aspect, a chastity value is calculated for each feature. Thechastity value may be a function of the relative distance from a featureto the two nearest Gaussian centroids. Optionally, features having achastity value below a threshold value are filtered out.

In an embodiment, a system for evaluating the quality of a base callfrom a sequencing read is provided. The system includes a processor, astorage capacity, and a program for identifying a nucleotide base. Theprogram includes instructions for obtaining a first set of intensityvalues and a second set of intensity values for each a plurality offeatures on an array. The intensity value for each feature in one orboth sets corresponds to the presence or absence of a particularnucleotide base out of four possible nucleotide bases at the feature.The program includes instructions for fitting four Gaussiandistributions to the intensity values. Each distribution has a centroid.The program includes instructions for calculating four likelihood valuesfor each feature, wherein each likelihood value indicates the likelihoodof a particular feature belonging to one of the four distributions. Theprogram includes instructions for selecting for each feature of saidplurality of features the distribution having the highest of the fourlikelihood values, wherein the distribution corresponds to the identityof the nucleotide base present at the particular feature.

In one aspect, fitting includes using one or more algorithms from thegroup consisting of: a k-means clustering algorithm, a k-means-likeclustering algorithm, an Expectation Maximization algorithm, and ahistogram based method. In particular embodiments, fitting includesusing an Expectation Maximization algorithm.

In another aspect, the program includes instructions normalizing theintensity values.

In another aspect, a chastity value is calculated for each feature.Optionally, the chastity value is a function of the relative distancefrom a feature to the two nearest Gaussian centroids. Optionally,features having a chastity value below a threshold value are filteredout.

FIG. 10 illustrates a system 200 formed in accordance with an embodimentthat may be used to carry out various methods set forth herein. Forexample, the system 200 may be used to carry out one or more of themethods 100 (FIG. 6), 120 (FIG. 7), 140 (FIG. 8), or 160 (FIG. 9).Various steps may be automated by the system 200, such as sequencing,whereas one or more steps may be performed manually or otherwise requireuser interaction. In particular embodiments, the user may provide asample (e.g., blood, saliva, hair semen, etc.) and the system 200 mayautomatically prepare, sequence, and analyze the sample and provide agenetic profile of the source(s) of the sample. In some embodiments, thesystem 200 is an integrated standalone system that is located at onesite. In other embodiments, one or more components of the system arelocated remotely with respect to each other.

As shown, the system 200 includes a sample generator 202, a sequencer204, and a sample analyzer 206. The sample generator 202 may prepare thesample for a designated sequencing protocol. For example, the samplegenerator may prepare the sample for SBS. The sequencer 204 may conductthe sequencing to generate the sequencing data. As described above, thesequencing data may include a plurality of sequencing reads that includenumerous base calls.

The sample analyzer 206 may receive the sequencing data from thesequencer 204. FIG. 10 includes a block diagram of a sample analyzer 206formed in accordance with one embodiment. The sample analyzer 206 may beused to, for example, analyze sequencing reads to provide a base calls.The sample analyzer 206 includes a system controller 212 and a userinterface 214. The system controller 212 is communicatively coupled tothe user interface 214 and may also be communicatively coupled to thesequencer 204 and/or the sample generator 202.

In an exemplary embodiment, the system controller 212 includes one ormore processors/modules configured to process and, optionally, analyzedata in accordance with one or more methods set forth herein. Forinstance, the system controller 212 may include one or more modulesconfigured to execute a set of instructions that are stored in one ormore storage elements (e.g., instructions stored on a tangible and/ornon-transitory computer readable storage medium, excluding signals) toprocess the sequencing data. The set of instructions may include variouscommands that instruct the system controller 212 as a processing machineto perform specific operations such as the workflows, processes, andmethods described herein. By way of example, the sample analyzer 206 maybe or include a desktop computer, laptop, notebook, tablet computer, orsmart phone. The user interface 214 may include hardware, firmware,software, or a combination thereof that enables an individual (e.g., auser) to directly or indirectly control operation of the systemcontroller 212 and the various components thereof.

In the illustrated embodiment, the system controller 212 includes aplurality of modules or sub-modules that control operation of the systemcontroller 212. For example, the system controller 212 may includemodules 221-223 and a storage system (or storage capacity) 226 thatcommunicates with at least some of the modules 221-223. The modules221-223 may be programs in some embodiments. The modules include aphase-correcting module 221, a quality evaluation module 222, and a baseidentifying module 223. The system 200 may include other modules orsub-modules of the modules that are configured to perform the operationsdescribed herein. The phase-correcting module 221 is configured togenerate a phasing-corrected intensity value as set forth herein. Thequality evaluation module 222 is configured to evaluate the quality of abase call from a sequencing read as set forth herein. The baseidentifying module 223 is configured to identify a nucleotide base asset forth herein.

As used herein, the terms “module”, “system,” or “system controller” mayinclude a hardware and/or software system and circuitry that operates toperform one or more functions. For example, a module, system, or systemcontroller may include a computer processor, controller, or otherlogic-based device that performs operations based on instructions storedon a tangible and non-transitory computer readable storage medium, suchas a computer memory. Alternatively, a module, system, or systemcontroller may include a hard-wired device that performs operationsbased on hard-wired logic and circuitry. The module, system, or systemcontroller shown in the attached figures may represent the hardware andcircuitry that operates based on software or hardwired instructions, thesoftware that directs hardware to perform the operations, or acombination thereof. The module, system, or system controller caninclude or represent hardware circuits or circuitry that include and/orare connected with one or more processors, such as one or computermicroprocessors.

As used herein, the terms “software” and “firmware” are interchangeable,and include any computer program stored in memory for execution by acomputer, including RANI memory, ROM memory, EPROM memory, EEPROMmemory, and non-volatile RAM (NVRAM) memory. The above memory types areexemplary only, and are thus not limiting as to the types of memoryusable for storage of a computer program.

In some embodiments, a processing unit, processor, module, or computingsystem that is “configured to” perform a task or operation may beunderstood as being particularly structured to perform the task oroperation (e.g., having one or more programs or instructions storedthereon or used in conjunction therewith tailored or intended to performthe task or operation, and/or having an arrangement of processingcircuitry tailored or intended to perform the task or operation). Forthe purposes of clarity and the avoidance of doubt, a general purposecomputer (which may become “configured to” perform the task or operationif appropriately programmed) is not “configured to” perform a task oroperation unless or until specifically programmed or structurallymodified to perform the task or operation.

Moreover, the operations of the methods described herein can besufficiently complex such that the operations cannot be mentallyperformed by an average human being or a person of ordinary skill in theart within a commercially reasonable time period. For example, themethods may rely on relatively complex computations such that such aperson cannot complete the methods within a commercially reasonabletime.

What is claimed is:
 1. A system comprising: at least one processor; anda non-transitory computer readable medium comprising instructions that,when executed by the at least one processor, cause the system to:determine a base call for a nucleotide base of a polynucleotide;generate an online-overlap metric, a purity metric, a phasing metric,and a start5 metric for the base call; determine a quality score for thebase call based on the online-overlap metric, the purity metric, thephasing metric, and the start5 metric; and generate output-base-callingdata comprising the base call for the nucleotide base and the qualityscore.
 2. The system of claim 1, further comprising instructions that,when executed by the at least one processor, cause the system todetermine the online-overlap metric for the base call by determining ameasure of separation between foreground intensity values correspondingto the nucleotide base and background intensity values.
 3. The system ofclaim 1, further comprising instructions that, when executed by the atleast one processor, cause the system to determine the purity metric forthe base call by determining a likelihood that the base call for thenucleotide base is reliable based on data for a current sequencingcycle.
 4. The system of claim 1, further comprising instructions that,when executed by the at least one processor, cause the system todetermine the phasing metric for the base call by determining, for asignal corresponding to the base call, a measure of noise from aprevious sequencing cycle and a next sequencing cycle.
 5. The system ofclaim 1, further comprising instructions that, when executed by the atleast one processor, cause the system to determine the start5 metric forthe base call by determining whether the nucleotide base for the basecall is located within an initial set of nucleotide bases of a read ordetermined during an initial set of sequencing cycles.
 6. The system ofclaim 1, further comprising instructions that, when executed by the atleast one processor, cause the system to: determine the quality score isabove or below a quality-score threshold; categorize the quality scorefor the base call in a group of quality scores based on the qualityscore being above or below the quality-score threshold; and generate theoutput-base-calling data comprising an indicator that the quality scorefor the base call corresponds to the group of quality scores.
 7. Thesystem of claim 1, further comprising instructions that, when executedby the at least one processor, cause the system to determine the qualityscore for the base call from a quality table generated by a Phredalgorithm based on online-overlap metrics, purity metrics, phasingmetrics, and start5 metrics.
 8. The system of claim 1, furthercomprising instructions that, when executed by the at least oneprocessor, cause the system to: determine that the quality score doesnot satisfy a reliability threshold; and discount the quality score asunreliable based on the quality score not satisfying the reliabilitythreshold.
 9. The system of claim 8, further comprising instructionsthat, when executed by the at least one processor, cause the system todetermine the reliability threshold by identifying transition pointsbetween quality scores for reliable base calls and quality scores forunreliable base calls.
 10. A non-transitory computer-readable mediumstoring instructions that, when executed by at least one processor,cause a computing device to: determine a base call for a nucleotide baseof a polynucleotide; generate an online-overlap metric, a purity metric,a phasing metric, and a start5 metric for the base call; determine aquality score for the base call based on the online-overlap metric, thepurity metric, the phasing metric, and the start5 metric; and generateoutput-base-calling data comprising the base call for the nucleotidebase and the quality score.
 11. The non-transitory computer-readablemedium of claim 10, further comprising instructions that, when executedby the at least one processor, cause the computing device to determinethe online-overlap metric for the base call by determining a measure ofseparation between foreground intensity values corresponding to thenucleotide base and background intensity values.
 12. The non-transitorycomputer-readable medium of claim 10, further comprising instructionsthat, when executed by the at least one processor, cause the computingdevice to determine the purity metric for the base call by determining alikelihood that the base call for the nucleotide base is reliablerelative to other non-called nucleotide bases based on data for acurrent sequencing cycle.
 13. The non-transitory computer-readablemedium of claim 10, further comprising instructions that, when executedby the at least one processor, cause the computing device to determinethe phasing metric for the base call by determining, for a signalcorresponding to the base call, a measure of noise from a previoussequencing cycle and a next sequencing cycle.
 14. The non-transitorycomputer-readable medium of claim 10, further comprising instructionsthat, when executed by the at least one processor, cause the computingdevice to determine the start5 metric for the base call by determiningwhether the nucleotide base for the base call is located within aninitial set of nucleotide bases of a read or determined during aninitial set of sequencing cycles.
 15. The non-transitorycomputer-readable medium of claim 10, further comprising instructionsthat, when executed by the at least one processor, cause the computingdevice to: determine the nucleotide base is incorporated at an end of aread for the polynucleotide; determine that the quality score does notsatisfy a reliability threshold; and discount the quality score asunreliable based on the nucleotide base being incorporated at the end ofthe read and the quality score not satisfying the reliability threshold.16. A method comprising: determining a base call for a nucleotide baseof a polynucleotide; generating an online-overlap metric, a puritymetric, a phasing metric, and a start5 metric for the base call;determining a quality score for the base call based on theonline-overlap metric, the purity metric, the phasing metric, and thestart5 metric; and generating output-base-calling data comprising thebase call for the nucleotide base and the quality score.
 17. The methodof claim 16, further comprising: determining the quality score is aboveor below a quality-score threshold; categorizing the quality score forthe base call in a group of quality scores based on the quality scorebeing above or below the quality-score threshold; and generating theoutput-base-calling data comprising an indicator that the quality scorefor the base call corresponds to the group of quality scores.
 18. Themethod of claim 16, further comprising determining the quality score forthe base call from a quality table generated by a Phred algorithm basedon online-overlap metrics, purity metrics, phasing metrics, and start5metrics.
 19. The method of claim 16, further comprising: determiningthat the quality score does not satisfy a reliability threshold; anddiscounting the quality score as unreliable based on the quality scorenot satisfying the reliability threshold.
 20. The method of claim 19,further comprising determining the reliability threshold by identifyingtransition points between quality scores for reliable base calls andquality scores for unreliable base calls.